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1 Introduction 


The increasing demand for specular freeform surfaces presented a major 
challenge for production and metrology in recent years. When inspecting 
specular reflective components, such as solar collectors, lens systems, 
wafers, or even telescope mirrors, the interest generally lies in manufactur- 
ing them as precisely as possible, which means that the exact geometric 
object dimensions have to be known. Reflection properties also play a 
crucial role for surfaces from the automotive industry, such as lacquered 
car body parts, or objects from the entertainment industry, since defects 
and flaws in the surface affect the aesthetics of the product to a great 
extent. The inspection of such surfaces is very demanding in practice. 
During the visual inspection of specularly reflective objects, in contrast 
to diffuse reflection, an observer does not see the surface itself, but the 
distorted mirror image of the environment. The reflective surface is vir- 
tually invisible to the observer. Automatic visual inspection, especially 
3D measurement, therefore poses a great metrological challenge. 
Deflectometric measurement methods use the law of reflection and 
knowledge of the arrangement between a camera and a pattern generator, 
e.g., a liquid crystal display (LCD) monitor, to draw conclusions about the 
shape of the surface by means of observing the deformations of the mirror 
images. For automatic visual inspection and accurate 3D reconstruction, 
precise knowledge of the system parameters is required, e.g., the size and 
position of the LCD monitor relative to the camera sensor, as well as the 
intrinsic camera parameters. If the 3D coordinates of a reference object 
point and its mirror reflection are known, the reflection point on the 
specular surface can in principle be calculated from them. However, the 
photographic measurement of an object point lacks distance information, 
that is, only directional information is available. Therefore, even with 
complete knowledge of the system, a single camera is generally not 
sufficient to calculate a unique surface from the measurement data. As 
a result, the solutions for the surface lie on a one-parametric solution 
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manifold and potential surface normals representing these surfaces can 
be calculated for any point in space. To find the true surface from this 
infinite variety, the problem must be regularized, where in principle, 
it would be sufficient to know only a single point. Starting from this 
point, the complete surface can then be reconstructed by integrating the 
deflectometrically measured normal field. 

Obtaining an accurate reconstruction necessitates a precise measure- 
ment. Hence, sophisticated and highly specialized optical imaging de- 
vices are becoming increasingly important for high-precision manufac- 
turing and environment perception. In particular, light field cameras 
are experiencing an ever-increasing interest in research and industry 
as they provide a four-dimensional light field of the scene instead of a 
two-dimensional image. The information captured by a light field camera 
in a single photographic exposure can be used to, for example, digitally 
refocus the image, extract depth information, or subsequently change 
the perspective of the scene. Light field cameras can therefore be re- 
garded as compact 3D cameras. In contrast to camera arrays, in which 
the individual synchronized cameras each sample a part of the light 
field, the hardware requirements for light field cameras are significantly 
reduced, and they are far more robust against external influences. Even 
ina compact handheld camera design, they can capture several hundred 
perspectives of the scene in a single shot. 

The advantages of light field cameras should therefore also be made 
accessible to the field of optical metrology. In particular, this thesis aims 
at combining light field imaging with deflectometry as this enables a 
variety of new measurement methods. The additional information com- 
pared to conventional cameras is to be used to regularize the ambiguity 
of the deflectometric measurement while providing a robust reconstruc- 
tion of the surface. While light field cameras offer several advantages 
for deflectometry, they also introduce new difficulties and challenges. 
Thanks to their design, light field cameras have a very high depth of field, 
which improves the lateral resolution of the deflectometric measurement, 
but at the same time, the amount of captured light is reduced, resulting 
in higher noise sensitivity. In addition, the calibration of these cameras 
is, unfortunately, very difficult due to their complex structure and so- 
phisticated optical design. To achieve the most accurate description of 
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the imaging process, advanced camera models and elaborate calibration 
techniques are required. To make light field cameras available for effec- 
tive use in deflectometry, special focus is placed on four aspects in this 
work: 


Registration: In deflectometry, the measurement of reference coordi- 
nates and thus the registration between camera pixels and points in the 
plane of the reference monitor provides the local slope of the surface. 
The angular resolution of the angle between the surface point and the 
reference point thereby determines the accuracy of the reconstruction. 
Therefore, for high precision measurements, the position of the reference 
feature must be determined with subpixel accuracy and has to be robust 
against noise influences. 


Calibration: To enable a highly accurate reconstruction of specular sur- 
faces, precise calibration is of essential importance for deflectometry. To 
triangulate surface points with sufficient accuracy, an intrinsic calibration 
ofthe camera and the monitor, as well as an extrinsic calibration of the 
measurement setup is mandatory. Due to the complex optical design of 
light field cameras, in this work, special emphasis is given to appropriate 
camera calibration. 


Regularization: In order to find an unambiguous solution for the spec- 
ular surface, additional information is needed. Hence, the special prop- 
erties of the light field camera are to be used to resolve the ambiguity of 
the deflectometric measurement and to allow extracting the true surface 
normal from the one-dimensional solution manifold. 


Reconstruction: The regularization provides an initial estimate of the 
surface. As deflectometry is a slope-measuring technique, the surface 
should be obtained by integrating the normal field. The light field camera 
is to be used to enable a robust reconstruction. 
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1.1 Contributions 


The main contributions of this thesis are as follows. 


= Deflectometry requires the coding of the reference monitor pixels. 
For this, phase-shift coding is used and different approaches for 
phase unwrapping are investigated. Since many approaches ne- 
glect the periodicity of the phase, modifications are proposed to 
improve the performance of existing unwrapping methods by us- 
ing a circular mean operation and circular distances. Furthermore, 
a new probabilistic approach for temporal phase unwrapping is 
developed that uses circular statistics to model the multi-frequency 
phase-shift coding and enables an optimal reconstruction of the 
phase. The developed method respects the periodicity of the phase, 
simultaneously unwraps all phase measurements using maximum- 
likelihood estimation, allows for an easy frequency selection with a 
maximum uniqueness range of the unwrapping, and additionally, 
includes the estimation of the phase uncertainty into the overall 
unwrapping process. Moreover, the method is extended by con- 
sidering the local pixel neighborhood resulting in a probabilistic 
approach for spatio-temporal phase unwrapping that outperforms 
state-of-the-art methods. 


= Since light field cameras have a complex optical design, more so- 
phisticated camera models are necessary. This work proposes to 
use a generic camera model, and a new approach for its calibration 
is developed in which the uncertainty of the calibration features is 
taken into account during optimization, leading to increased accu- 
racy of the overall camera calibration. The problem is divided into 
two subproblems, a camera ray calibration and a reference target 
pose estimation, and to make the optimization feasible, alternat- 
ing minimization is applied. Further, a closed-form least-squares 
solution for the ray calibration subproblem is presented, and the 
pose estimation subproblem is efficiently solved using a gradi- 
ent descent optimization on the rotation manifold. All of this is 
achieved by minimizing a single objective function, where conver- 
gence is guaranteed. In addition, acceleration techniques are ap- 
plied to obtain an almost quadratic convergence rate. Experimental 
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evaluations show that the proposed method outperforms standard 
calibration techniques and other generic approaches and yields a 
high-precision calibration. 


The reference monitor is modeled by a polynomial shape model 
and a refraction model, and it is shown how the model and the 
estimation of its parameters can be efficiently integrated into the 
generic calibration framework, which further increases the calibra- 
tion accuracy. 


A new approach for light field reconstruction is developed using 
the generic camera calibration as a basis. The approach is com- 
pletely generic and can be used to reconstruct light fields from 
arbitrary light field imaging systems, independent of whether the 
camera is based on microlenses, mirrors, or coded apertures, or 
whether it is realized by employing a camera array. Despite being 
estimated from a generic and unconstrained set of camera rays, 
the method outperforms state-of-the-art light field calibration ap- 
proaches and yields rectified images with an accurate intrinsic 
calibration. 


Different approaches are developed to use the light field camera for 
the regularization of the deflectometric normal measurement. For 
partially specular surfaces, a classical light field depth estimation 
approach is used to obtain an initial estimate of the surface height. 
Furthermore, an approach is presented that estimates the distance 
to the reflected reference monitor and uses this to calculate the 
distance to the surface. For the reconstruction of fully specular 
freeform surfaces, a stereo deflectometry approach is adapted to 
implement a light field-based multi-view-deflectometry approach 
that allows triangulating of the surface. 


For the reconstruction of specular surfaces, a method is developed 
that fuses the depth estimates obtained through regularization 
with the deflectometrically measured surface normals. For this 
purpose, a variational fusion approach is adapted to account for 
the multi-view property of the light field camera, resulting in an 
improved reconstruction accuracy. 
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Some parts of this work have already been published elsewhere: 


= The analysis of phase-shift coding and the methods for phase 
unwrapping from chapter 4 were published in [A7, A2, Al]. 


= The uncertainty-based calibration approach for the generic camera 
model from Sec. 5.2 and parts of the evaluation from Sec. 5.5 were 
published in [A5]. 


= Contributions to the generic light field reconstruction, as presented 
in chapter 6, were published in [A4, A3]. 


= Some aspects of the light field-based regularization techniques and 
the fusion of depth and normal estimates for surface reconstruction, 
as presented in chapter 7, were published in [A9, A6]. 


In contrast to these original publications, the content of this thesis has 
been changed considerably and the evaluation is much more detailed. In 
particular, not only partial aspects are examined. Instead, the interaction 
of individual components with each other is investigated, and the influ- 
ence of the entire processing chain on the performance of the final result 
is examined. 


1.2 Overview 


The remainder of this thesis is structured as follows. Starting with chap- 
ter 2, basic mathematical concepts used throughout this work are pre- 
sented. Chapter 3 provides the theory of light fields, light field imaging, 
and its applications. Furthermore, it explains the working principles of 
deflectometry, presents its difficulties, and formulates the steps required 
for light field-based specular surface reconstruction. 

In chapter 4 the first step of the deflectometric measurement pipeline is 
analyzed, i.e., the registration of camera pixels with features on a reference 
monitor. The principles of phase-shift coding are introduced, the state of 
the art in the field of temporal phase unwrapping is reviewed, suggestions 
for improvement are made, and as the main content, a new probabilistic 
approach to spatio-temporal phase unwrapping is presented. The chapter 
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concludes with an extensive comparison of the proposed methods with 
state-of-the-art methods. 

Chapter 5 presents the calibration of the entire deflectometric mea- 
surement setup. Starting from the basic principles of camera calibration, 
the motivation behind the use of more advanced generic camera models 
is introduced, and an alternating minimization-based approach for cali- 
bration is presented, taking into account the uncertainty of calibration 
features that were obtained through phase-shift coding. After this, the 
modeling of the reference monitor is explained, and it is demonstrated 
how estimating its parameters can be integrated into the generic calibra- 
tion framework. Furthermore, it is shown how the generic camera model 
can be utilized to perform the extrinsic calibration of the deflectometric 
measurement setup. 

Subsequently, in chapter 6, the results of the generic camera calibra- 
tion are reused to decode light fields from raw camera data. With this, 
the inherent 4D topological ray-space of the light field is reconstructed, 
preserving both the information of the observed scene and the geomet- 
ric structure of the light field by adequate rectification and calibration. 
Further, different resampling strategies are discussed and the proposed 
method is compared to state-of-the-art light field calibration methods. 

Eventually, chapter 7 demonstrates how light field cameras can be effi- 
ciently combined with deflectometry. Possibilities for a light field-based 
regularization are proposed, which can solve the ambiguity of the surface 
normal estimation. A variational surface reconstruction approach is pre- 
sented, which fuses the regularization points with the deflectometrically 
measured surface normals and enables high-precision reconstruction. 
Furthermore, different surfaces are investigated and several aspects ofthe 
entire deflectometric measurement chain are examined for their influence 
on the surface reconstruction. 

Finally, chapter 8 summarizes the presented work and draws conclu- 
sions, providing further insights into future research possibilities. 


2 Preliminaries 


This chapter introduces the basic mathematical principles used in this 
work. These include useful operators, the mathematical parameterization 
of rotations, lines, and surfaces in 3D space, and optimization techniques. 
The purpose of this chapter is to provide a general list of tools needed 
for this work. All information is gathered here to avoid impeding the 
flow of reading in later chapters. The following is therefore primarily 
intended as a reference. 


2.1 Operators 


Reshape Operators 
The vec-operator vectorizes a matrix by stacking its columns: 


b; 
vec(B) = | P2 | , with B = (b, ba, =, bay) (2.1) 
by 
where Be R’*M , b, € R* and vec(B) e R". 
The mat-operator is the inverse of the vec-operator, and reshapes a 
vectorized matrix back to its original form: 


mat(b) := B, where b = vec(B) . (2.2) 


The vec-operator is compatible with the Kronecker product ® . With 
A € R***,B e RY™M,C e RY, a useful equation can be de- 
rived [63]: 


vec(ABC) = (CT @ A)vec(B) , (2.3) 


where (CT @ A) € RXN*ZM and vec(B) e R. 
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Skew-Operator 


The cross-product between vectors can be formulated using the skew- 
operator [|:], . For £, u € R? the skew-operator is defined as follows: 


0 —& & 
ele = E3 0 —é |- (2.4) 
With it, the useful relations 


Ex u= [ë] u = [u]] E = -u x £ (2.5) 


can be formulated. Applying the vec-operator on the skew-operator can 
be formulated as a matrix-vector product: 


vec([E],) = Zé, (2.6) 
Z = |vec([e,],) , vec([eg],.) , vec([es],.)] , (2.7) 


with the unit basis vectors e], €5, e3. 


Directional Derivative 


Let M be a smooth submanifold of a Euclidean space and p a point of M. 
Let f be a function defined in a neighborhood of p that is differentiable 
at p. With the tangent vector € to M at p, the directional derivative of 
f along €, can be defined. Given a curve yon M with 7(0) = p and 
+(0) = &, the directional derivative is defined by [1] 


De f (p) = OF (VE)le=o - (2.8) 


2.2 Rotation Parametrization 


Rotations in 3D space have three degrees of freedom. There are different 
parametrizations, which contain more or less redundant information, 
and which are subject to more or less constraints [188]. Depending on the 
application for which a mathematical description of rotations is required, 
different parametrizations can be advantageous. In this work, rotations 
are represented throughout by rotation matrices. 
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2.2 Rotation Parametrization 


Rotation Matrix 


Rotation matrices are elements of the special orthogonal group in three 
dimensions R € SO(3) c R?*3 , which are subject to several constraints: 


SO(3) = {R € R” | RTR =I, det (R) =1}. (2.9) 


A rotation matrix is described with nine parameters 


Tia Tiz Tis 
R= f| ra 7% Tas | =(r1;r2,r3), (2.10) 
T31 T32 T33 


where |r, || = 1 for n = 1, 2,3. The transposed rotation matrix is its own 
inverse R” = R™' , and the column vectors r,,r, rą span the coordinate 
system that the rotation matrix transforms to. 


Local Parametrization of SO(3) 


Rotation matrices are very intuitive because the rotation of a 3D point 
can be realized by simple matrix-vector multiplication. If rotations are 
needed for parameter estimation or optimization, the highly redundant 
rotation matrices are only of limited use due to the many constraints. If 
used in optimization, derivatives often have to be calculated. However, 
the simple calculation of derivatives of individual parameters does not 
lead to meaningful results, since rotation matrices are defined on the 
Riemannian manifold SO(3) and this property is lost if not handled 
correctly [24]. Derivatives must therefore be calculated directly on the 
manifold [81, 138, 176]. 

The smooth and differentiable Riemannian manifold SO(3) is a finite- 
dimensional Lie group [189]. Every matrix Lie group is associated with 
a Lie algebra. The corresponding Lie algebra so(3) is the set of all 3 x 3 
skew-symmetric matrices 


s0(3) = {Q = [é], ER? |E E RË}, (2.11) 


which is the tangent space of the Lie group at the identity element [24]. 
The mapping from any element [£], € so(3) to R € SO(3) is called 
the exponential map R. = Exp ([£],) and is defined using the standard 
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matrix exponential series. It can be calculated in closed form using the 
well known Rodrigues rotation formula [125]: 


ee nen. BE 
je El) + ep 


The reverse map from the Lie group to the Lie algebra [£], = Log (R) is 
called the logarithmic map [125]: 


Exp (£l) = e =1+ 


(1 — cos(&])). (2.12) 


9(R-R") 


Log (R) = nn) 


(2.13) 


with 0 = cos’! (=F) 


2 


Therefore, one can find a smooth parametrization gr (€) = Exp((&],.) R 
of the SO(3) manifold in a local neighborhood of R, which is differen- 
tiable with respect to € € R? using the tangent space. 


2.3 Line Parametrization and Plucker 
Coordinates 


In 6D-Plücker-space a Plücker-line 1 € P® is defined by its direction 
vector d € R? and its moment vector m € R? [186, 207]. A line in 3D- 
space has four degrees of freedom, therefore two constraints apply to 
the Plücker-line: 


p° = { (X )|d me R’, d'm =0,1a]= 1} (2.14) 


The moment vector can be calculated with m = p, x d, where p, € R? is 
an arbitrary point on the line 1, see figure 2.1. The moment vector stands 
perpendicular to the line and its norm ||m|| corresponds to the Euclidean 
distance of the line to the origin. Given two points p,,p. € R?, the 
Plücker-line I" = (dt, m") traversing both points can be calculated: 


Pi Po (2.15) 
Ipı — Pol 
m=p,xd=p, xd. (2.16) 
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2.3 Line Parametrization and Plücker Coordinates 


Figure 2.1 A Plücker-line 1 is defined by its direction vector d and moment vector m. 
It can be calculated from two points p; and py on the line. 


A rotation and translation of the line in 3D-space is achieved using simple 
matrix operations [12]: 


=R= © 3 l, (2.17) 
=T= (i 4 1, (2.18) 


where R € SO(3),t € R? and [-],, are the rotation matrix, the translation 
vector and the skew operator, respectively. The Euclidean distance d(l, p) 
of a line 1 to an arbitrary point p € R° is defined as the distance to 
the closest point on the line. It is found by translating the origin of the 
coordinate system into the point p 


(le arm) 99 


and by calculating the distance between the translated line and the new 
origin: 
a(l, p) = d(l’, 0) = |m’| = |p x d — m|| . (2.20) 
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2.4 Surface Parametrization 


Surfaces are represented in this work by point clouds or, when needed, 


by a two-dimensional function 
z:R-R, 
(2.21) 
(s,t) + z(s,t). 


where the set of s, t values defines the topological relationship (e.g., be- 
tween camera pixels), and z(s,t) represents the corresponding depth 
or height value. The surface is hereby defined by discrete values or a 
continuous implicit parametric description. 


Relation between Surface Normal and Gradient 


Let s,t be the image coordinates of a surface, n(s, t) the corresponding 
surface normal, and x(s,t) = (a(s,t),y(s,t), 2(s,t))" a surface point. 
Calculating the surface gradient with respect to the image coordinates 
now depends on the model of projection [163]. 

With an orthographic projection a 3D point is projected orthogonally 
onto the image plane. The image coordinates equal the point coordinates: 


x(s,t)=s, (2.22) 
y(s,t) =t. (2.23) 


The cross product of the 3D point’s partial derivatives is normal to the 
surface: 
O,.XXOx~ nD. (2.24) 


By normalizing this, and choosing the sign so that n points toward the 
camera, one obtains 


1 O,z 
n = —— (2) , (2.25) 
4/14 |V2|? \—1 


where Vz = (8,2, 0,2)" denotes the gradient of the depth map z. Solv- 
ing (2.25) for the surface gradient then yields 


= 19,2 __ ifm 
rl) om 
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2.5 Primal-Dual Optimization 


With a perspective projection, the projected coordinates now depen- 
dend on the depth and the focal length f of the used camera [163]: 


x(s,t) = A (2.27) 
f 
y(s,t) = = 2 (2.28) 


The cross product of the 3D point’s partial derivatives is normal to the 
surface and parallel to the normal vector, implying 0,x x 0.x x n=0, 
which results in the equation system 


0 = fn3d,z +n, [z + 80,2 + vô;z] , 
0 = fn3d,2 + na [z + 80,2 + v0.2] , (2.29) 


0 = nad,2 — 240, 2. 


Knowing z > 0 holds for the depth map and substituting 2 = ln(z) 
makes (2.29) linear in the partial derivatives 0,2 and 0,2: 

0 = [naf + nis] 0,2 +, t0,2 +n, 

0 = [ng f + nat] 0,2 + Nas0,2 + Nna, (2.30) 


0 = nað, Z — 24 0;2. 


This can then be easily inverted, providing a formula for the surface 
gradient of the substitute depth map z: 


_ _ 0.2 1 nı 
= = | 82) = ——__—__ e 2.31 
g:= Vz 6 sn, +tng + fng a 2.31) 


2.5 Primal-Dual Optimization 


For variational optimization, often the primal-dual formalism is used to 
find efficient optimization algorithms that allow for a smooth minimiza- 
tion of non-smooth functions [34]. 

Let X, Y be two finite-dimensional real vector spaces and let the general 
optimization problem be of the form 


min F(Kx) + G(x), (2.32) 
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where K : A — Y is a continuous linear operator, F : Y > R’,CG:X > 
R* are convex functions, while Fcan be discontinuous. The primal-dual 
formulation of this is the convex-concave saddle point problem [25] 


min max (Kx, y) + G(x) — F*(y), (2.33) 
XEX yey 
where (-,-) is an inner product, x is considered the primal variable, y the 
dual variable, and F* denotes the convex conjugate of the function F: 


Fo) = sup {{y,x) FR}. (2.34) 
xE 
Independent of the convexity of F, the convex conjugate is always a 
convex function. The saddle point optimization problem can be efficiently 
solved in an alternating manner using primal-dual algorithms [34]: 


yirth) = prox, p (y™ + oKx'”)) , (2.35) 
x (r+) = prox,g (x* — TK*y*) , (2.36) 
x int) = x (r+) +9 (xe) = x") ‘ (2.37) 


where 7, g, 0 are parameters and K* is the adjoint of the operator K . The 
primal variable x is updated in each iteration n with a proximal descend, 
the dual variable y is updated with a proximal ascend, and a final ex- 
trapolation step increases the convergence rate. The proximal operators 
can be formulated through optimization of an independent subproblem: 


= . f x=" 
prox,¢(x) = arg min { = + Goo} . (2.38) 
The advantage of the primal-dual algorithms is that the difficult optimiza- 
tion problem (2.32) can be iteratively solved, in which only the proximal 
operators need to be evaluated, where in many cases an analytic solution 
for the subproblems can be provided [146]. 
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3.1 Deflectometry 


Specular surfaces can be found in numerous areas of industrial produc- 
tion. For instance, they appear in lacquered body parts of the automotive 
industry, in entertainment products, in glazed ceramics, or in the pro- 
duction of high-precision mirror optics, such as those used in telescopes. 
Depending on the degree of specularity, an observation will reveal im- 
age features composed of a superposition of direct surface features and 
features from the reflected image of the environment. Obtaining 3D infor- 
mation about physical objects is a significant application of automated 
visual inspection methods. However, many of these methods fail when 
examining fully specular objects, especially triangulation-based methods 
such as stereo vision or fringe projection profilometry. The reason for 
this is that, in contrast to diffuse reflection, an observer does not see the 
surface itself, but the distorted mirror image of the surroundings. The 
specular surface is practically invisible to the observer. Automatic visual 
inspection, especially 3D measurement, therefore is a major challenge. 
While a human observer can intuitively make assumptions about the 
surface by watching the distortion, various computer vision techniques 
try to imitate this principle, e.g., shape from specular reflection and shape from 
distortion [11]. A certain subclass of these are the so-called deflectometric 
methods. The measurement setup consists here of a camera and an active 
illumination source, e.g., a commercially available monitor. By illuminat- 
ing with a known reference pattern, information about the surface can be 
obtained from the observed distortions. In detail, deflectometry makes it 
possible to obtain highly precise slope information of the surface, which 
can be used for 3D reconstruction or defect detection. The advantages of 
deflectometry are that it is very robust, can be realized with inexpensive 
hardware, and the measurement sensitivity is limited geometrically by 
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Monitor 


s Specular Surface 


Figure 3.1 Deflectometric measurement principle: The camera observes distorted refer- 
ence patterns as reflection in the surface. Knowing the reference point, a surface normal 
can be calculated for every point on a camera ray. 


the resolution of the camera sensor and the extent of the measurement 
setup. This makes it interesting for many industrial applications. 


3.1.1 Measurement Principle 


The most basic experimental setup for a deflectometric measurement is 
illustrated in figure 3.1. It consists of three components: an illumination 
source displaying structured patterns, a specular object under test, and a 
camera. For the light source, standard LCD monitors are usually used 
that can be actively controlled, or reference patterns are projected onto a 
canvas by means of a projector. The reference shows a pattern or a series 
of patterns, which are then reflected on the examined specular object. 
In deflectometry, the specular surface itself is part of the system and is 
located in the optical path between the illumination and the camera. The 
reference pattern is therefore distorted by the curvature of the surface, 
and the resulting warped pattern can be imaged with a conventional 
digital camera. Assuming each ray is reflected only once, which is true 
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for many technical surfaces, and since the object is specularly reflective, 
a camera pixel sees either exactly one position on the screen or none. 

In a camera-fixed coordinate system, starting from a camera pixel, a 
vision ray can be constructed with direction $ € S° starting from the opti- 
cal center of the camera. The ray hits the surface in the point s = ps, with 
|s| = 1. At the surface, the ray is reflected, hits the reference monitor, 
and observes a feature x(s) in the monitor plane. If in a fully calibrated 
system the transformation of monitor coordinates to the camera coor- 
dinate system is known, the position of the observed monitor feature 
relative to the camera can be calculated: 


p=Rx+t. (3.1) 


Using the law of reflection, the surface normal n of the observed point 
can then be specified as the angle bisector between the camera ray Sand 
the reflected ray S, : 
n(p) =8,-8= POSS - PO; (3.2) 
Ip—sl Isl Ip — sl 
The integration of the normal field of all camera pixels finally yields the 
reconstruction of the investigated surface. 

However, a problem arises here, because in general, the length p of the 
vector s is unknown in (3.2). This means that a one-parametric set of hy- 
pothetical surface normals can be calculated for each camera ray, which 
in turn leads to an ambiguity of the surface estimation. More precisely, a 
surface normal can only be calculated correctly if the corresponding sur- 
face point is already known, and the surface can only be reconstructed if 
the surface normals are provided. To resolve the ambiguity of the deflec- 
tometric measurement, additional regularizing information is required. 
In principle, it would suffice to measure only one point of the surface 
and to reconstruct the surface from the normal field starting from this 
point by assuming a continuous surface [11]. However, if more samples 
are available, this can help to reduce the influence of an uncertain and 
noisy measurement of a single surface point. 
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3.1.2 Related Works 


Deflectometry has a long history in computer vision and optical metrol- 
ogy. Among the earliest work, Sanderson et al. [174] proposed a structured 
highlight illumination approach using an array of point light sources 
to illuminate a specular surface, and they estimated surface orientation 
using a stereo camera. The first promising results for optical metrology 
were demonstrated in the work of Petz and Ritter [151] and Petz and 
Tutsch [152, 153]. They proposed reflectance grating photogrammetry 
for the measurement of specular surfaces by using a linear position- 
ing unit to move a flat reference structure into different positions from 
which the illumination direction is derived. By applying the triangulation 
principle between the camera rays and known illumination direction, 
they then determine point-wise the absolute 3D object coordinates with 
high precision. Knauer et al. [103] analyzed the investigation of specular 
freeform surfaces through phase-shift coding and introduced the term 
phase measuring deflectometry for the first time. They further described 
many aspects, e.g., the measurement principle, the physical limits of the 
method, and the calibration of the system components. Bothe et al. [22] 
gave practical demonstrations of their fringe reflection technique, which 
allowed nondestructive testing of specular surfaces and high-resolution 
3D shape measurement. And thus, deflectometry was promoted as a 
novel technique for the measurement of specular freeform surfaces. 


Applications 


Since for industrial applications often only quality assurance or defect 
detection is of interest, many pure inspection methods exist. Häusler et al. 
[74] proposed a microscopic PMD system with nanometer sensitivity for 
local surface features. Xiao et al. [228] used deflectometry to measure the 
3D shape of aspherical mirrors. Olesch et al. [142] used deflectometry for 
large-scale estimation of telescope mirrors. Häusler et al. [75] and Faber 
etal. [51] compared deflectometry with interferometry and describe the 
advantages and disadvantages. Werling and Beyerer [220] proposed in- 
verse patterns that are computed in advance for known test objects and 
that can be used for fast and robust defect detection on specular ob- 
jects. Su etal. [192] investigated deflectometry with an infrared source to 


20 


3.1 Deflectometry 


analyze rough optical surfaces. Höfer et al. [83] and Höfer [82] presented 
approaches that allow coding of the reference patterns in the infrared 
spectrum, and thus render infrared deflectometry industrially useful. 


Regularization 


The deflectometric normal measurement is inherently ambiguous, thus, 
additional information is needed for 3Dreconstruction. Therefore, several 
approaches for regularization exist. 

Li et al. [113] use an additional confocal white-light distance sensor to 
precisely determine a single surface point from which the surface can be 
reconstructed. Huang et al. [87] use an external laser tracker to precisely 
measure the system setup and mirror surface position, and compare it 
with a virtual system setup, which subsequently yields a high precision 
reconstruction. 

When additional assumptions are made about the surface, the recon- 
struction can be simplified and a solution approximated. By neglecting 
higher-order surface properties, the reconstruction task can be reduced 
to a finite-dimensional parameter estimation problem, which in general 
has a unique solution [9]. Liu etal. [118] show that the surface can be 
reconstructed uniquely under certain conditions if it is at least twice con- 
tinuously differentiable. Pak [143] adopts this approach and simplifies 
the mathematical description. Liang et al. [114] characterize the surface 
locally as a low-dimensional model and build their approach on the work 
of Savarese et al. [177]. Huang et al. [85] describe the surface using a global 
model, and they find the surface through parameter optimization. 

Various methods exist to reconstruct the direction of the illumination 
utilizing a multi-monitor approach. Here, the monitor can be moved. 
with a linear positioning unit [152, 153], or the approach can be imple- 
mented without mechanical movements by using a beam splitter and 
two separate monitors [120, 247]. In this context, Han et al. [71] present an 
idea that can reconstruct the surface even with an uncalibrated camera 
model and unknown monitor poses. Similarly, a directed illumination 
can also be realized with telecentric optics, which enables triangulating 
the surface [184, 237]. 

When attempting to apply classical stereo vision to specular surfaces, 
initially there is the difficulty that only virtual features can be captured 
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in the two camera images (cf. Sec. 7.1.2). However, specular stereo can be 
achieved by correlating the normal vector fields induced by two measure- 
ments, which indirectly enables surface triangulation. For this purpose, 
starting from the ambiguous normal vector field, Bhat and Nayar [17] 
seek the simultaneous solution of two partial differential equations, one 
for each viewpoint. Bonfort and Sturm [21] use a correlation measure for 
point-wise reconstruction by voxel carving. Balzer et al. [10] extend the 
principle to measure large objects by using multi-view specular stereo. 

The limit case of the stereo approach is represented by specular flow. 
It is assumed that the movement between image configurations is so 
small that the correspondence between image and scene points is main- 
tained across two images. Roth and Black [172] combine diffuse and 
specular flow to reconstruct partially specular surfaces. Balzer [9] derives 
model equations of specular flow that can also describe nonlinear cam- 
era motions. Adato et al. [2] provide a solution for shape from specular 
flow, which makes it possible to reconstruct the surface by observing the 
specular flow induced by an unknown environment motion field. Pak 
[144] derives a simple relation between specular flow and the Gaussian 
curvature of specular surfaces. However, the method has a practical dis- 
advantage: no coded illumination can be used because the camera has to 
be moved continuously for specular flow. 


Reconstruction 


Although regularization provides a rough estimate of the surface, the 
advantages of deflectometry are that it can determine the local slope of 
the surface very precisely, i.e., the measurement of the surface normals 
is generally significantly more precise than the direct measurement of 
surface points. Based on the unambiguous surface normals obtained 
from the regularization, the surface can be reconstructed. 

Various works exist that describe the surface using a two-dimensional 
polynomial and convert the reconstruction into a parameter estimation 
problem [85]. For this, depending on the shape of the specular object, 
different surface models are used, e.g., Zernike polynomials, radial basis 
functions, or Forbes polynomials [50, 85, 166]. 

Other approaches consider the reconstruction problem as normal in- 
tegration or gradient integration. In principle, there are two concepts 
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for this. Local methods integrate the surface along predetermined paths. 
Horbach and Dang [84] propagate the regularization information by 
region growing starting from at least one known surface point to inte- 
grate the normal field. Neighboring surface normals can be computed 
by assuming the continuous differentiability of the surface, which al- 
lows a local regularization. However, in doing so, they also propagate 
the measurement error along the path leading to a global shape devi- 
ation [50]. Since the normal field is typically corrupted by noise, it is 
therefore seldom integrable and curl-free. Therefore, the error of the 
integration depends on the chosen path. For this reason, variational ap- 
proaches are often used as global methods, where only the integrable 
part of the normal field is considered, and the integration task is formu- 
lated as an energy minimization problem [163]. Since the integration of 
surface normals also occurs in many other applications (e.g., photometry, 
profilometry), there exists much literature on the subject. Chang et al. 
[35] use level set methods to integrate a multi-view normal field and 
apply this to photometric stereo images. In the case of deflectometry, 
Balzer et al. [10] use multi-view regularization to obtain an initial surface 
estimate, and they refine this by integrating the normal field. This is done 
by iteratively solving a Poisson equation using finite-elements analysis, 
updating the reconstructed normals and the measured normal field con- 
currently. Quéau and Durou [162] explore edge-preserving integration 
of normal fields by examining different energy functionals for the recon- 
struction of discontinuous surfaces. Quéau et al. [164] present several 
total variation-like integration approaches where surface normals and 
depth estimates can be fused into one surface. Antensteiner et al. [8] com- 
pare different algorithms that fuse depth values with gradient estimates, 
with application to light field photometric stereo. 

While there is still other related work relevant for deflectometry, e.g., 
the coding of the structured illumination and the calibration of the mea- 
surement system, they are not discussed here but later in their associated 
chapters, see Ch. 4 and Ch. 5. For more details on deflectometry, its appli- 
cations, further regularization and reconstruction techniques, the reader 
is referred to the comprehensive reviews in the literature [11, 86, 163, 222]. 


23 


3 Background 


3.2 Light Fields 


The light propagating through space contains a variety of information. 
Within the field of geometrical optics, the theoretical background for 
a description of this propagation is provided by the plenoptic function 
that assigns a radiance value to the light rays present in a physical space. 
It assumes that the usual 3D space is traversed by light propagating in 
all directions, and the light may be blocked, attenuated, or scattered. 
To account for all possible variations of light, the plenoptic function 
takes a seven-dimensional description P(x, 0, À, T) : R” > R. Arbitrary 
radiance values can be assigned at any location in space x € R°, for any 
possible directional angle 8 € R? , any wavelength \, and any time r. 
While the plenoptic function is mainly of conceptual interest in this work, 
recently, it found applications in the field of scene reconstruction and 
novel view synthesis [129, 236]. 

In contrast, light fields have a more practical meaning, since they allow 
the description of imaging systems in which only the rays that reach 
the camera sensor are relevant. By introducing additional constraints, 
the light field can be derived from the plenoptic function [92]. If only 
single points in time are considered or if the light is integrated over the 
exposure time, the temporal dimension r of the plenoptic function can 
be omitted. The integration over the spectral sensitivity of the camera 
pixels eliminates the spectral dimension X of the plenoptic function. 
Thus, the light field is considered monochromatic. However, in this work, 
color or more abstract coded information may be assigned to the rays, 
although this will not be implicitly stated. The most important reduction 
of dimensions is achieved by the so-called free space assumption [110]. 
In homogeneous media that are free of occluders, the radiance along a 
ray is constant. Hence, the spatial dependency of the plenoptic function 
can be reduced by one dimension. Moon and Spencer [132] called the 
resulting function photic field, while in the field of computer graphics it is 
titled 4D light field [110] or Lumigraph [64]. Formally, the 4D light field 
L(u,v, s,t) is defined as the radiance along light rays in an empty space, 
where the coordinates (u, v, s, t) correspond to a certain parametrization 
of the spatial and angular dependencies of the light field. The array of 
rays ina light field can be modeled in different ways. The most commonly 
used parametrization is the two-plane parametrization, where a light 
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Figure 3.2 Two-plane parametrization of the 4D light field: A light ray is described by the 
coordinates of intersection with two parallel planes. 


ray is uniquely described by the intersections of two parallel planes with 
angular coordinate (u, v), spatial coordinates (s,t), and with distance 
f between both planes, see figure 3.2. This may not represent all rays, 
for example, rays parallel to the two planes, provided the planes are 
parallel to each other. The advantage, however, is that its description 
is closely related to the analytical geometry of perspective imaging in 
optical systems. 

A simple way to visualize the two-plane parametrization of the light 
field L(u, v, s, t) is to imagine it as a discrete collection of many perspec- 
tive images of the s,t-plane, each of which is taken from a different 
observation position in the u, v-plane with a virtual camera. Hence, for 
each fixed angular coordinate (ug, vg) a two-dimensional slice can be ex- 
tracted from the light field, which in the following is called a subaperture 
image (SAI): 

SAL, u (8) t) = L(uo, vo; 531), (3.3) 
where each SAI resembles a conventional image. By fixing an angular 
coordinate and the spatial coordinate whose axis is parallel to that coor- 
dinate, a so-called epipolar plane image (EPI) is obtained: 

EPI, s (v,t) = L(ug, v, Sot), (3.4) 


EPI, to (u, 5) = L(u, vo, 8, to), (3.5) 
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Figure 3.3 Interpretation of the light field as a camera array. Each SAI represents a “virtual” 
camera that is slightly shifted with respect to the other cameras. The dashed lines indicate 
the coordinates of the extracted EPIs. 


where, depending on which coordinates are fixed, a horizontal or vertical 
EPI is extracted. Figure 3.3 shows an example light field as an array of 
virtual cameras that are slightly shifted against each other, as well as a 
horizontal and a vertical EPI. Due to the change of perspective for each 
angular coordinate, the EPIs show lines of different slopes, whose orien- 
tation provides information about the depth of the observed scene points. 
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3.2.1 Light Field Acquisition 


The easiest way to sample the continuous light field L(u, v, s, t) is to use 
a mechanical gantry to place a conventional camera at different positions 
in the u, v-plane and capture the scene [205]. Of course, instead of a 
time-sequential capture, this can be efficiently implemented hardware- 
parallel using multi-camera arrays to allow obtaining high-resolution 
light fields [233]. While camera arrays can be miniaturized and a config- 
uration of specialized camera modules can be assembled, building and 
maintaining camera arrays is costly and cumbersome. 

In contrast, single-shot light field cameras have been proposed that 
image the light field through a single main lens and encode the four- 
dimensional information onto a two-dimensional camera sensor. The 
most commonly used designs for such light field acquisition devices 
are microlens-based light field cameras. While the basic idea of such 
cameras was already described by Lippmann [116] as early as 1908, only 
modern computing power and advances in the fabrication of microscopic 
structures made commercialization possible. The first design of a light 
field camera was introduced much later in 1992 by Adelson and Wang [3], 
who called it plenoptic camera. And one of the first hand-held prototypes 
was built by Ng et al. [137], which was then commercialized by Lytro Inc. 
The camera’s layout is similar to that of a conventional camera with the 
essential difference that an array of microscopically sized lenses is placed 
in front of the sensor. By adding this microlens array (MLA), it becomes 
possible to capture a section of the 4D light field L(u, v, s,t) of ascene and 
encode it onto the 2D sensor. In particular, there are different designs. 

When the distance between the MLA and the sensor corresponds to 
the focal length of the microlenses, the camera is an unfocused plenop- 
tic camera, see figure 3.4. The coordinates of the light field’s two-plane 
parametrization are represented here by the s, t- and u, v-coordinates, 
whereby s,t define the position of a microlens in front of the sensor, 
and thus, they encode the spatial dimension of the light field. Hence, 
they can be interpreted as macro pixels. The u, v-coordinates define the 
position within the microlens relative to its center and in this way, they 
implicitly provide information on where a light ray has passed through 
the main lens. They represent the angular information of the light field. 
Since the microlenses are relatively small, their size is usually below 
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Sensor MLA Main lens 


Figure 3.4 Schematic representation of an unfocused plenoptic camera. 


100 um, the main lens is almost infinitely far away when compared to the 
distance between the sensor and MLA. The rays entering the microlenses 
can therefore be assumed to be parallel. As a consequence, rays that 
are imaged onto the central pixel of the sensor region belonging to a 
microlens originate from the center of the main lens. And rays away from 
the edge of the main lens are projected onto pixels corresponding to 
angular coordinates away from the microlens center. Consequently, each 
u, v-coordinate samples only a sub-area of the camera’s aperture. Hence, 
each SAI shows a very high depth of field, due to the small opening. To 
avoid overlapping between different microlens images, the f-numbers of 
the main lens and the microlenses must be matched to each other [137]. 
In the unfocused design, the spatial resolution is defined by the number 
of microlenses in front of the sensor, whereas the angular resolution is 
defined by the number of pixels under each microlens. 

Since the sensor’s resolution is fixed, the spatial resolution of the light 
field decreases with increasing angular resolution. Because of this trade- 
off, new camera designs were introduced that allow a light field to be 
captured with significantly higher spatial resolution than the traditional 
approach, enabling the rendering of high-resolution images that meet 
the expectations of modern photographers [122]. MLA-based light field 
cameras in the so-called focused design were first introduced by Lumsdaine 
and Georgiev [123], and then later commercialized by Raytrix GmbH. In 
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Sensor MLA Main lens 


Figure 3.5 Schematic representation of a focused plenoptic camera. 


this design, the distance between the MLA and the sensor differs from 
the focal length of the microlenses, see figure 3.5. Hence, the microlenses 
don’t sample the main lens’ aperture but a virtual image plane. The 
relation between light field coordinates and the optical components of 
the camera is no longer as intuitive as it was before. With the focused 
design, the number of pixels under each microlens no longer corresponds 
directly to the angular resolution. Rather, the microlenses now show 
micro-images of the scene. Each microlens can therefore be interpreted as 
a tiny virtual camera, where depending on the position of the microlens, 
both the optical center of the virtual camera is shifted and a different small 
section of the scene is observed. The pixels underneath the microlens 
thus encode spatial information, while the microlens position contains 
both spatial and angular information, due to a slightly different view 
of the scene in each micro-image. In particular, there are even different 
configurations for focused plenoptic cameras. Because the micro-images 
only scan the virtual image plane onto which the main lens images 
the scene, the micro-images have a significantly reduced depth of field 
compared to the unfocused design. This led to the introduction of multi- 
focus plenoptic cameras, which have micro-lenses with different focus 
lengths [148]. In this way, a focused image can be constructed at any depth 
of focus, and a really wide range of digital refocusing can be achieved [61]. 
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Apart from microlens-based light field cameras, a variety of more 
exotic designs exists, e.g., cameras based on coded apertures [211], multi- 
spectral light field cameras [178, 179], or light field objectives that can turn 
any standard camera into a light field system by using kaleidoscope-like 
imaging optics [128]. 


3.2.2 Related Works 


While a conventional camera only captures the spatial information of a 
scene, a light field with the additional angular information can be used, 
for example, to change the perspective on the scene or to change the posi- 
tion of the observer [137]. Moreover, even after capturing a dynamically 
active scene, it becomes possible to shift the focus plane by rendering new 
images from the 4D light field data [135]. Due to the highly redundant 
information, light fields can be used for a variety of applications, e.g., de- 
noising and deblurring [4, 43], super-resolution [19], segmentation [219], 
material recognition [213], hyper-spectral imaging [181], structure from 
motion and visual odometry [94, 238], to name a few. 

A popular research area is light field-based disparity estimation, which 
can be used directly to estimate depth if the camera is calibrated. A 
variety of methods exist for this purpose [98]. Due to the multi-view 
property of the light field, well-established feature-based approaches 
comparable to stereo imaging can be used [77]. Approaches for depth 
from focus /defocus [199] or disparity from EPIs exist [212, 218]. Using 
the EPIs, lines of constant intensity appear, and disparity estimation can 
be performed with a local orientation estimate [216] or by a local line 
fitting [244, 245]. More generally, the disparity information is represented 
by the two-dimensional slope of constant-intensity planes embedded in 
the 4D space. In recent years, deep learning approaches have become 
state of the art in disparity estimation, as they can provide more robust 
local slope estimation [76, 187], or even incorporate the full 4D light field 
information into the process [78, 124, 234]. 

In the field of partially specular reflection (or partial transparency), 
the EPIs show a superposition of lines representing the direct depth 
of the partially specular surface and the indirect depth of the reflected 
scene, respectively. For these situations, many approaches exist to model 
and remove specular highlights [41], to estimate both depths simultane- 
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ously [95, 232], or to separate both image layers to obtain two separate 
light fields [95, 96, 193]. Light fields are also used to detect and classify 
non-Lambertian objects, such as refractive or transparent objects [126, 
224, 225]. Ideguchi et al. [91] estimated the surface of transparent objects 
based on local photo consistency. Lu et al. [121] used light fields to sample 
surface BRDFs and developed an architecture based on convolutional 
neural networks (CNN) for BRDF identification. Alperovich etal. [7] 
used a deep encoder-decoder network that solves non-Lambertian intrin- 
sic light field decomposition, which can recover albedo, shading, and 
specularity. Light field cameras have found applications in the field of 
optical metrology as well. Ziwei et al. [252] use a light field camera as 
an additional geometric constraint to resolve the ambiguity of phase 
unwrapping, which serves as the basis for many optical metrology ap- 
plications. Liu etal. [119] achieve high dynamic range 3D imaging by 
using a light field camera for multi-view fringe projection profilometry, 
Zhou et al. [251] combine the light field’s EPI-based depth estimation to 
improve profilometric reconstruction, and Farber et al. [58] demonstrate 
that by using spectral light fields, an application like depth estimation or 
profilometry can be improved even more. 
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Deflectometry is used for high-precision surface measurement and dense 
3D reconstruction of specular objects. In this context, it is necessary to 
carry outan optical position encoding to be able to reconstruct the surface 
by means of triangulation. As described in Sec. 3.1, the objective of the 
deflectometric registration is to determine an imaging function, which al- 
lows direct mapping of camera pixels to points in the monitor plane. With 
the help of this registration, local defects in the surface under test can 
be detected or the surface can be reconstructed globally, see Ch. 7. Apart 
from deflectometry, optical encoding techniques can also be used in the 
field of camera calibration, where reference features displayed by an ac- 
tive target drastically decrease the calibration error as compared to when 
standard checkerboard features are used, see Ch. 5. Hence, to ensure 
precise measurements, the registration must be as accurate as possible. 
In order to determine the imaging function, the positions on the refer- 
ence plane and thus the pixel coordinates of the monitor screen must be 
uniquely assigned to pixels in the camera employing an encoding process. 

There are a number of possibilities for such an encoding. In principle, it 
would be most straightforward to turn on each individual reference pixel 
one at a time and check which camera pixel is measuring an increase in 
intensity. However, this would take a considerable amount of time. It 
makes more sense to encode all reference pixels simultaneously using 
more advanced methods. A local encoding of the reference pixels can be 
done by displaying statistical patterns where each position within this 
pattern is identified by the local pixel neighborhood [173]. While this 
method enables very fast measurements, since only one pattern has to be 
displayed, it is only of limited use for the measurement of more complex 
scenes. Because the surface typically distorts the reference pattern, the 
encoding of the local neighborhood can often no longer be recognized. 
To achieve a high-accuracy measurement, a temporal encoding of each 
pixel is more suitable. Here, instead of a single pattern, a sequence of 


33 


4 Deflectometric Registration 


patterns is now displayed by the reference. The sequence of intensity 
values measured in the camera subsequently allows decoding the ref- 
erence pixels and yields the determination of the imaging function. A 
popular temporal coding method is the coding of the reference pixels 
by means of a gray code [159]. Here, a binary pattern sequence is dis- 
played by the reference to uniquely code the individual pixels. However, 
a major disadvantage of the gray-code method is that it uses only binary 
intensity values. As a result, the displayed signal with its sharp edges 
has high-frequency components. Because most of the time the camera 
and the surface provide a blurred image of the reference pattern, these 
edges become blurred and the decoding becomes more difficult. Another 
disadvantage is that only discrete pixels can be encoded and no subpixel 
information can be extracted [159]. 

Because of these disadvantages, phase-shift coding methods have be- 
come widely accepted in structured illumination applications. Here, a 
sequence of sinusoidal signals is displayed by the reference, whereby the 
coding of the pixel coordinate is contained in the phase of the sinusoidal 
signal. The great advantage of these methods is that they are robust to a 
variation in the ambient illumination, to noise, to low-pass filtering due 
to a defocusing effect of the camera, and that they allow an estimation 
of the phase uncertainty [59]. At the same time, these methods enable a 
subpixel-accurate encoding if the reference pixels are slightly out of focus. 
To further increase the accuracy of the measurement, multi-frequency 
methods are used, where sinusoidal pattern sequences with different 
frequencies are displayed. While this increases the accuracy of the reg- 
istration, the periodicity of the sinusoidal pattern sequence leads to an 
ambiguous position encoding in the entire measurement range with 
just a single phase measurement. The uniqueness range of the phase 
measurement initially extends only over one period of the underlying 
sinusoidal pattern. This leads to a modulo-27 phase wrapping, which 
can only be compensated using phase unwrapping methods. 

When only one phase measurement is available, spatial unwrapping 
methods must be used, which examine the local 2D neighborhood of the 
phase map and use spatial information to unwrap it. For applications 
where several phase measurements can be performed, the so-called tem- 
poral multi-frequency phase unwrapping methods have proven to be 
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the best choice, since they allow a pixel-individual unwrapping. These 
temporal methods are generally categorized into four groups: hierar- 
chical methods [88-90, 102, 147, 201], heterodyne methods [37, 42, 107, 
150, 158, 169, 170, 203, 204, 214, 253], number-theoretical methods [45, 
46, 69, 160, 161, 190, 198, 202, 248], and distance minimization-based 
methods [54-57, 111, 149, 255]. They differ in the way the unwrapping is 
performed, in which frequency configurations can be used, and in how 
large the resulting uniqueness range of the unwrapping is. However, 
a disadvantage of the classical methods is that typically not all phase 
measurements are unwrapped at the same time. Moreover, they often do 
not take into account the inherent periodic structure of the phase, which 
leads to erroneous results. More importantly, the estimation of the phase 
uncertainty is completely neglected in the entire unwrapping procedure. 

To overcome these deficiencies, this chapter presents a probabilistic 
approach for phase unwrapping, which uses circular statistics to de- 
scribe the multi-frequency phase-shift coding to optimally reconstruct 
the phase. The presented approach respects the periodicity of the phase, 
implicitly unwraps all phase measurements simultaneously by finding 
the underlying optimal position encoding that caused the phase mea- 
surement using maximum-likelihood estimation, allows for an easy fre- 
quency selection with a maximum uniqueness range of the unwrapping, 
and additionally, includes the estimation of the phase uncertainty into 
the overall unwrapping process. Furthermore, in this chapter, it is pro- 
posed to not only perform a temporal unwrapping but to additionally 
incorporate the information of the local pixel neighborhood in the mod- 
eling and thus obtain a probabilistic approach for spatio-temporal phase 
unwrapping. 

The structure of this chapter is as follows: Sec. 4.1 discusses the general 
concept of phase-shift coding and shows how the phase and the phase 
uncertainty can be reconstructed from the sinusoidal pattern sequence. 
Sec. 4.2 introduces the principles of phase unwrapping. Sec. 4.3 describes 
how the state-of-the-art phase unwrapping algorithms can be optimized 
by slight modifications. Eventually, Sec. 4.4 presents the probabilistic ap- 
proach for phase unwrapping. Finally, in Sec. 4.5 the presented methods 
are extensively analyzed and compared to the state of the art. 
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4.1 Phase-Shift Coding 


In principle, to obtain an absolute coordinate, one could display a single 
sinusoidal signal or a linearly increasing intensity curve on the reference 
system and then assign an intensity value to each pixel. However, since 
commercially available monitor screens or projectors can only display 
a limited number of discrete intensity levels (usually only 8 bits), one 
would have to expect strong quantization errors. Furthermore, this sim- 
ple approach would be very vulnerable to external influences, such as 
a variation of the ambient illumination or attenuation of the signal’s 
amplitude. Therefore, it makes more sense not to use the signal intensity 
as an information carrier but rather the phase of a sinusoidal signal. 
The basic principle of phase-shift coding is to assign an individual 
phase p (2, y), p(x, y) of sinusoidal signals to each reference pixel (x, y) : 


p Pelt) = p,(, y) 
ef * 
where the coordinates of the pixels are interpreted as relative coordinates 
x,y € [0, S) with S = 1 for the rest of this chapter. 

Phase-shift coding must be performed independently in both the hori- 
zontal and vertical direction, which is why only the encoding in the x 
direction is considered in the following. The encoding in the y direction 
is done analogously. Further, the argument of the phase is also simplified 
by omitting the coordinate y, since the phase in x direction will take the 
same value for each y. In other words, in the following y(x) := y,(z, y) 
holds without loss of generality. 

To encode a normalized monitor coordinate x € [0,1), a signal se- 
quence of M sinusoidal patterns with frequency f and shifted by W,,, is 
generated and displayed on a monitor screen, whereby the coordinate is 
contained in the phase y(x) = 27 fx of the signal sequence 


(4.1) 


I 
Inla) = Ex (1 +c0s(pla) + Yu): (4.2) 
Here Laax represents the maximum displayable brightness value. The 
type of phase-shift coding is determined by the choice of the discrete 
phase-shift W,, and can be influenced by the number and also the val- 
ues of the shifts, see [99, 194] for a comparison of possible methods. In 
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this work, only the most widely used class of phase-shift algorithms is 
considered, the so-called symmetric M-step algorithms with equidistant 
phase offsets 
Vy, = Rt, me[l,2...,M]. (4.3) 
The signal 7,, displayed on the reference illuminates the scene that is 
to be examined and is then mapped onto the camera sensor. In the case 
of deflectometry, the signal is emitted by a monitor screen, reflected at a 
specular surface, and projected into the camera. When using phase-shift 
coding to obtain reference features for camera calibration, the camera may 
directly observe the monitor screen. Thus, regardless of the application, 
a camera records a signal sequence for every camera pixel u = (s, t)" 


Iau) = A(u) + B(u) cos (plu) + Ym), (4.4) 


with m = 1,...,M. Here A(u) is a constant background illumination, 
B(u) is the modulation of the signal and y(u) is the phase that contains 
the information about the encoded screen pixels x(u) . 

Because each camera pixel u can be considered independently, the 
coordinates u are neglected in the following for clarity. 

To determine the three unknown quantities A, B, y from the recorded 
signal sequence, at least M > 3 phase shifts are needed and the formulas 
for the solutions can then be derived [194] 


1“; 
A= — . 
in (4.5) 
2 
= nen) , (4.6) 
M M _ 
p = arctan2 (- 5 In sin(Yn), y Im cx) , (4.7) 
m=1 m=1 


where arctan2(a,b) € [—7,7) is used, which correctly assigns the argu- 
ments of the arctangent to the four quadrants. Also, for sake of simplicity, 
in the remainder of this chapter the domain of the phase is shifted to 
positive values: 

p = p mod 27 € [0,2r). (4.8) 
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Figure 4.1 Top: Displayed cosine pattern with Y = 0. Bottom: Corresponding phase 
maps. The phase is wrapped for f > 1. 


From equations (4.5), (4.6), (4.7) it becomes clear that the encoding 
of the phase is robust to many external influences. A locally variable 
ambient illumination would only affect the offset A. Attenuation of the 
signal amplitude by, for example, a dark surface would only reduce the 
contrast of the signal, resulting in a smaller modulation B . The impor- 
tant information, the phase p, however, remains in principle completely 
unaffected by this. Furthermore, since the sinusoidal pattern sequence 
consists only of single signal components with the frequency f but has 
no higher frequency components, the method is also very robust against 
low-pass filtering caused by blurring. It can be shown that only the modu- 
lation B is reduced, whereas the phase y remains unaffected. To be more 
precise, it is even advantageous to slightly image the pattern sequence 
out of focus, as this blurs the individual pixels of the reference pattern 
and allows subpixel accuracy to be achieved in the encoding [182]. 


4.1.1 Phase Uncertainty 


The accuracy of the phase measurement is influenced by external system- 
atic influences of the entire measurement setup as well as by stochastic 
errors. For example, the nonlinearity of the intensity characteristic of the 
reference system can degrade the phase measurement. This however can 
be easily compensated using gamma calibration procedures or by using 
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phase-shift coding with more shifts [117, 241]. Therefore, it will not be 
a subject of further consideration in this work. Other external system- 
atic influences may change the brightness and contrast of the pattern 
sequence, which can lead to an increase in uncertainty. For example, the 
camera optics can image the sinusoidal patterns out of focus, which leads 
to a decrease in contrast. As the surface is usually part of the structured 
illumination system, the shape, roughness, and color of the surface also 
influence the quality of the estimation. Due to these system-related influ- 
ences, the uncertainty of the phase estimation can be different for each 
pixel. Furthermore, the phase measurement is influenced by stochastic 
errors. Every camera image is accompanied by image noise. It is obvious 
that this noise also affects the phase estimation and influences the un- 
certainty of the measurement. In general, the sensor noise shows up as 
noise in the pixel values and can be regarded in a good approximation 
as normally distributed noise with variance or and zero mean [226]. 

Li etal. [112] show that the phase noise £, can be calculated through 
Gaussian error propagation from the noise £; of the images of the 
pattern series: 


M Mi 
2sin(y+ W,,,) 
en. 8) 


Further, for symmetrical M-step methods, the phase noise has zero mean 
and its uncertainty, i.e., the standard deviation of Ey, can be specified: 


M 
22% i A zus Bel: 4.10 
FMB 2 pt Ww) NMB: (4.10) 


While Band M can be estimated or are directly defined by the phase- 
shift coding, the sensor noise is initially unknown. To be able to describe 
the phase noise absolutely, Fischer et al. [59] introduced a quantitative 
noise model, which combines the phase noise with the parameters of the 
EMVA 1288 standard for camera systems [226]. This makes it possible 
to predict the phase uncertainty very precisely by calculating only the 
modulation B from the pattern sequence. 

To further reduce the uncertainty, it is useful to use sinusoidal pattern 
sequences with a frequency f > 1. This has two beneficial effects. The 
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first effect is a reduction in the quantization noise. Because conventional 
monitors or projectors can be operated with 256 intensity levels, sinu- 
soidal patterns with small frequencies show considerable steps in the 
signal (e.g., 1920 pixels of a monitor screen cannot be described without 
ambiguities when only 256 intensity values are used). Increasing the fre- 
quency provides locally a higher dynamic in the pattern. This attenuates 
the influence of quantization errors [59]. The second and more important 
effect is a reduction of the phase noise induced by the camera sensor 
noise. As explained in more detail in the next section, phase jumps occur 
in the reconstructed phase when the frequency of the sinusoidal pattern 
sequence is chosen to be f > 1. The phase would take values y > 27 
but is only defined on the periodic interval [0, 27) . Thus, the real line R 
is wrapped to the smaller interval [0, 277) , see figure 4.1. To unwrap the 
phase again, an integer multiple of 27 must be added at corresponding 
places, see Sec. 4.2. The unwrapped phase finally results in 


d,=p+r2nkte,, (4.11) 


where » € [0, 277) represents the wrapped phase, k € Z is the unwrap- 
ping factor and £, € [0, 27) represents the phase noise with uncertainty 
g. Since the domain of the unwrapped phase has been increased to 
$, € [0,27 f), it has to be scaled back to the original range. The final 
phase measurement therefore results in 


= ®, 5 gp +2rk $ Ey 
f f fe 
with © € [0, 27) . By increasing the frequency and then scaling back, the 


phase information is not changed, but the noise is reduced by the factor 
1/f. The uncertainty of the unwrapped phase is then be given by 


b (4.12) 


1 1/2 0r 

en EN E 4.13 

a Fe iV WB (4.13) 
In summary, with the phase-shift coding one obtains not only a pure 

position encoding but additionally also the associated uncertainty, where 

the complete information is encoded in the phase y and the phase uncer- 

tainty o,. 
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4.2 Principles of Phase Unwrapping 


If the frequency of the phase-shift pattern sequence is chosen to f > 1, 
jumps occur in the reconstructed phase. These jumps appear whenever 
the phase would exceed the value 27 but is mapped back to the interval 
[0, 27) by the arctangent (4.7). Resolving these jumps is the goal of phase 
unwrapping. For this purpose, an integer multiple k of 27 is added to 
the wrapped phase, which is often called the period-order number or 
unwrapping factor. Since the wrapping of the phase strongly depends 
on the chosen frequency, the optimal choice of the unwrapping factor is 
also frequency-dependent. For a coordinate x and frequency f;, (4.11) 
can be rewritten: 


P(x) = plx) + 27k, (x), k; € [0,1,..., [fi] — 1]. (4.14) 


The task of phase unwrapping is to find the correct k; for each phase 
measurement. Since an individual unwrapping factor exists for each 
pixel, the problem from (4.14) is initially under-determined. To get a 
solution anyway, additional information has to be used. In principle, 
there are two approaches to solve the problem: spatial and temporal 
phase unwrapping. 

Spatial phase unwrapping algorithms are useful when it cannot be 
guaranteed that the phase remains constant over time or when repeated 
measurements would be too costly. With spatial algorithms, phase un- 
wrapping is performed using only a single phase measurement. The infor- 
mation necessary for the unwrapping is then obtained from the 2D pixel 
neighborhood. For example, in region growing-based approaches, start- 
ing from an initial pixel, the phase is unwrapped aiming to achieve a con- 
tinuous phase profile where neighboring pixels have a similar value [183, 
243, 250]. However, spatial unwrapping is very susceptible to noise, and 
phase discontinuities can make the unwrapping difficult or cause er- 
rors. For example, a step in the phase cannot be reconstructed without 
ambiguity, since the algorithm is unable to determine the step’s height, 
which may have a multiple of 27 as an offset. The main disadvantage 
of spatial unwrapping methods is that they can generally only obtain 
a relative phase instead of an absolute one, which is not useful for 3D 
reconstruction problems. Hence, if the requirements for spatial phase 
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unwrapping are not satisfied or an absolute phase estimate is needed, 
temporal phase unwrapping must be used. 

While this work is focused on phase-shift coding, these phase wrap- 
ping effects also appear in other fields of optical metrology, e.g., inter- 
ferometry [37, 42, 158], SAR imaging [40, 155], or even time-of-flight 
imaging [47, 48]. Thus, the phase unwrapping problem influences many 
other applications. 


4.2.1 Temporal Phase Unwrapping 


Temporal phase unwrapping methods in general do not use the spatial 
information in the phase map. They can therefore handle each pixel indi- 
vidually, which means that discontinuities in the phase do not cause any 
problems. On the other hand, they rely on additional information ob- 
tained by additional measurements. This can be achieved, for example, by 
recording additional image patterns that can be decoded unambiguously. 
Methods based on temporal gray-coding achieve unambiguous coding 
and can be used as a basis for phase unwrapping [175, 240]. However, 
they cannot achieve sub-pixel accuracy and are susceptible to noise and 
defocusing effects [254]. An encoding using statistical patterns allows spa- 
tial decoding, which can be used directly for phase unwrapping. While 
these methods allow for a fast acquisition time, the evaluation of statis- 
tical patterns has similar drawbacks as the spatial phase unwrapping 
algorithms. For an overview of absolute phase unwrapping methods, the 
reader is referred to the literature [242, 254]. 

This work is focused on another class of unwrapping methods: Tem- 
poral multi-frequency phase unwrapping. These methods use multiple 
phase-shift pattern sequences with different frequencies f; to obtain mul- 
tiple phase measurements y, € [0, 277) , all of which are based on the same 
coordinate encoding. Depending on the frequencies, the phase measure- 
ments are wrapped differently. Since it is assumed that the unwrapped 
phase does not change over time, the multiple phase measurements gen- 
erate a system of equations, where each equation has the form of (4.14). 
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Because all phase measurements are based on the same coordinate x, if 
certain requirements are met, the equation system has a unique solution 
®, Yi k; 


r= a = 7 (4.15) 


The unwrapping of the phase measurements is then obtained by solving 
this equation system, for which various methods exist. 


4.2.1.1 Hierarchical Unwrapping 


The hierarchical methods are among the most intuitive approaches. They 
use a series of phase measurements in which the frequency of the un- 
derlying sinusoidal signals is increased in each step. To obtain an un- 
ambiguous unwrapping of all phase measurements, the frequency of 
the first measurement is chosen in such a way that the measured phase 
is not subject to ambiguities. Thus, fọ = 1 and ®, = yy. Each subse- 
quent measurement is then unwrapped using the previous unwrapped 
phase associated with the lower frequency as a reference ® ef = ®,_1, 
fet = F;_ı - The unwrapping factor can hereby be determined using a 


simple rounding operation 


A De 9; 
Ge ZZ; (4.16) 


and the respective phase is unwrapped with 


®,=y,+2rk,. (4.17) 


There are many variations of hierarchical unwrapping algorithms in 
the literature, which differ mainly in the choice of the frequency sequence, 
e.g., linearly increasing frequencies [88, 90], exponentially increasing 
frequencies [89, 147], reversed sequences [90, 102] or generalized ap- 
proaches [201]. Usually, after unwrapping the individual phase maps, 
the phase corresponding to the highest frequency is used or all phase 
maps are averaged. 
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4.2.1.2 Heterodyne Unwrapping 


The two-wavelength heterodyne methods were originally developed 
for interferometry [37, 42, 158] but are also applicable to phase-shifting 
3D-measurement systems [169, 170, 253]. Unlike before, the heterodyne 
method can be implemented directly for high frequencies. Usually, only 
two frequencies f, and f, are used. The phase measurements associated 
with the two frequencies are subtracted 


Pia =YPı Yo mod 27, (4.18) 


and the frequency of the synthetic phase y}, is then given by 


fie = If - Pl, (4.19) 


where fio represents the beat frequency. If f4 and f, are well-chosen, the 
uniqueness range of the phase unwrapping can be increased enough to 
resolve the ambiguity [204]. With the normalized reference size S = 1 
that is used in this chapter, it can be shown that fi> = |fı — fol < 1 must 
hold in order to allow an unambiguous phase reconstruction. 

Since the phase noise of y, and p, is accumulated during the forma- 
tion of the synthetic phase, the signal-to-noise ratio deteriorates. For 
this reason, the synthetic phase is generally used only to unwrap the 
underlying measurements y, and vy, . The unwrapping factors k, and ka 
are hereby calculated using (4.16) with fret = fia, Prep = V12- 

The extension to more than two frequencies is described in [107, 214] 
and allows increasing the unambiguous measurement range even fur- 
ther. For this, several approaches exist that optimize the choice of the 
frequencies to obtain a robust unwrapping result [150, 203, 204]. 


4.2.1.3 Number-Theoretical Unwrapping 


The number-theoretical unwrapping methods are based on number the- 
ory, relative primes, and the divisibility properties of integers. They were 
originally proposed by Gushov and Solodkin [69]. They were then further 
improved to reduce the susceptibility to phase errors [160, 198, 202, 249]. 
In its basic form, the method uses the Chinese-remainder theorem to 
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calculate a simultaneous solution to the unwrapping problem. Following 
the theorem, a system of simultaneous equations of congruence 


X=b, (mod m,), fori=1,..,n (4.20) 


has a unique solution X € Z,if b; € Zand m, € Z are known integers, 
where the set of m, are pairwise co-prime numbers, i.e., for their greatest 
common divisor applies gced(m,,m,) = 1, Vi,7. The solution itself is 
then given by 

X= 5°M,M{b, (mod m), (4.21) 


where m, M,,M; € Z with 


m=|][m,,M,=—,M,Mj=1 (mod m,), (4.22) 
: m, 


and where the M; can be found using, e.g., the extended Euclidean 
algorithm [39]. The theorem can be applied to the phase unwrapping 
problem, by comparing (4.15) to (4.20) and substituting 


®,S pS S 
X := mE oes = Ea ‚m; rs (4.23) 
If the condition lem(m},ma,...) > S for the least common multiple is 
fulfilled, the phase ambiguity can then be resolved [115]. Hereby, an 
appropriate scaling factor S needs to be chosen to obtain meaningful 
integer values and co-primes m, . In the case of a deflectometry applica- 
tion, it can be set to the size of the monitor screen measured in pixels. 
Further improvements to the algorithm can be achieved by precalculating 
a look-up table to speed up the computation time [45, 46, 161, 190, 248]. 


4.2.1.4 Distance Minimization-Based Unwrapping 


The previous methods have relatively high restrictions on the choice of 
frequencies. Thus, newer approaches try to circumvent these restrictions 
by posing the phase unwrapping as an optimization problem. Pribanić 
etal. [161] extend the two-wavelengths number-theoretical method by 
removing the restriction of having co-prime wavelengths. From the com- 
bination of all possible unwrapping factors, they search for the one that 
minimizes the distance between the two respective unwrapped phases. 
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The excess fraction methods can be regarded as a multi-wavelength 
extension of the heterodyne methods [54-57]. They define an excess 
fraction as the difference between an ideal continuous unwrapping factor 
and its integer analogon. The unwrapping factors are then determined 
individually by minimizing the respective excess fraction, where each 
excess fraction is influenced by all phase measurements. 

More recent approaches try to perform the unwrapping of all phase 
measurements simultaneously to find an ideal solution for all unwrap- 
ping factors at the same time. For this purpose, the vector of ideal un- 
wrapping factors k = (k4, ka, ... ) is sought that minimizes the distance 
of the individual unwrapped phases to the mean value of all unwrapped 
phase measurements. Here, the distance measure can be defined by an 
orthogonal projection of the wrapped phases onto a subspace [149], or 
it can be written down directly as a sum of distances between the un- 
wrapped phases to the averaged unwrapped phase [111, 255]. It is hence 
titled projection distance minimization (PDM). 

With © = (6,,4,,...)7, 6, = y; + 2rk;, £ = (ff) and by 
minimizing the projection distance 


T 
k =argmin | — P®|? , with P = T ; (4.24) 
k 


the unwrapping factors, and thus, the simultaneous unwrapping of all 
phase measurements can be obtained. Here, P® represents the projec- 
tion of unwrapped phase measurements, which for the ideal choice of k 
should be equal to ®. The optimal unwrapping factors are thereby found 
by an excessive trial and error of all possible combinations. To speed up 
the optimization, Petković et al. [149] suggest ignoring impossible combi- 
nations and Zuo etal. [255] use the geometry of the measurement setup 
of a profilometry system to further exclude unreasonable combinations. 


4.3 Improving the Phase Unwrapping Algorithms 


The classical phase unwrapping algorithms from the previous sections 
do not use all of the information to unwrap the phase measurements. Far 
more importantly, they generally do not take into account the inherently 
periodic structure of the phase, which can lead to incorrect unwrapping. 
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For example, the simple hierarchical unwrapping method from the 
last section only uses the previous phase measurement with a lower 
frequency to unwrap the current phase. However, phase measurements 
with higher frequency could also contain information to unwrap the 
phase measurements with lower frequency. In addition, the periodic 
structure of the phase is not taken into account, so unwrapping errors 
often occur near the 2r-discontinuities. To achieve good accuracy for 
3D reconstruction, phase maps corresponding to high frequencies are 
needed. But then the number of necessary measurements is high be- 
cause the sequence always starts at f = 1. The heterodyne method does 
not have to start at low frequencies but can directly select high ones, 
achieving an overall smaller mean uncertainty with the same number 
of measurements [150]. However, it is disadvantageous that the unam- 
biguous measurement range of the unwrapping is determined by the 
beat frequency. Thus, there are frequency configurations that do not 
yield an unambiguous solution but could be solved unambiguously with 
other methods [149]. Additionally, the method is not straightforwardly 
extendable to a multi-frequency approach. The number-theoretical un- 
wrapping methods perform a simultaneous unwrapping of all phase 
measurements and also consider the periodicity of the phase. Neverthe- 
less, the restriction to pairwise co-prime wavelengths makes the selection 
more difficult, and due to the integer arithmetic and rounding opera- 
tions, these methods are relatively susceptible to noise [198]. Even more, 
for the method to work, the frequencies must be chosen very precisely 
proportional to the integer co-prime wavelengths, which is especially 
problematic for applications where the wavelengths cannot be chosen 
freely, e.g., interferometry [53]. The PDM method, on the other hand, per- 
forms a simultaneous unwrapping of all phase measurements without 
having to apply rounding operations. In its current form, however, it is 
still not perfect. It does not take into account the periodic structure of the 
phase so that unwrapping errors occur frequently near the boundaries of 
the coding interval. Also, it is a very expensive procedure due to testing 
all possible combinations of unwrapping factors k,. Additionally, all 
methods have in common that the phase unwrapping does not consider 
the estimated phase uncertainty at all, although it could help to compen- 
sate for an unfavorable measurement. Therefore, the following sections 
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present improvements to the classical phase unwrapping algorithms to 
deal with their shortcomings. 


4.3.1 Weighted Circular Mean 


Multi-frequency phase unwrapping gives different estimates ofthe phase 
®; with different uncertainty o}, f, = ne In many practical applica- 


tions, only the phase with the highest frequency is used, or in the better 
case, a weighted average of all phase measurements is calculated. Here 
either the frequency is chosen as the weighting factor or the inverse of the 
estimated uncertainty (4.13) is used. Usually, in the sense of an unbiased 
estimator, the variance of the measurement is used as weighting to obtain 
the phase 


—2 
ð= i To, fii 
=- z 
> Torfi 


However, this very common approach ignores an important property 
of the phase. The phase is periodic on the interval [0, 2) . Since the phase 
measurement is affected by noise, a true phase value of y = 0 can, for 
example, be estimated as the value y, = 0.01 in one measurement and 
as yı = 1.997 in a second measurement. As a result, the mean value is 
not y = 0 as expected but y = a + a) ~ T. Therefore, very large 
errors appear at the boundary of the coding interval. A commonly used 
workaround for this problem is to artificially reduce the used encoding 
interval. That means instead of displaying phase values in the range 
[0, 277) on the screen, only the values [A,,, 27 — A,,) are used. Depend- 
ing on the expected noise, an optimal size can even be determined for 
A, € (0,7) [149, 150]. However, a reduction of the used interval leads to 
a lower SNR in the remaining part, due to the effective frequency of the 
represented sinusoidal signal being reduced. 

With this in mind and since the phase is periodic in the interval [0, 27), 
the usual arithmetic mean must not be used. Instead, this work proposes 
to use a circular mean value M’ , which is formed by mapping the phase 


(4.25) 
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measurements to the complex unit circle z; = e’?: and by calculating a 
weighted mean of the complex pointers 


—2 
2 ah 
=D ’ 
> fi 


where the circular mean of the phase can then be calculated from the 
argument of the resulting complex pointer 


g= 


(4.26) 


M’ (®,o, ¢) = arctan2 (Im(2), Re(2)) , 


= arctan? (£ sin(®,) 5 5 =) : (4.27) 


í vif i í Toif i 
Additionally, the uncertainty of the mean phase 
1 
2 — 


can be estimated to be further used in any subsequent application. 


4.3.2 Modified Hierarchical Unwrapping 


A disadvantage of the standard hierarchical phase unwrapping is that 
the phase measurements belonging to higher frequencies are unwrapped 
solely with the help of the previously unwrapped phase measurement. 
For the case of more than two used frequencies, it makes sense to modify 
the standard approach to make the unwrapping more robust against 
errors. It is advisable to use not only the last phase measurement as a 
reference, but the average of all phases already processed. The more fre- 
quencies are used, the more the method will benefit from all the previous 
unwrapped phases. For the averaging operator, the weighted circular 
mean from the previous section is used. With it, the periodicity of the 
phase can be partially compensated, and by using the phase uncertainty 
as a weighting factor, the overall unwrapping is improved due to pe- 
nalizing low-quality phase measurements. The standard hierarchical 
algorithm is relatively easy to adjust to obtain the modified hierarchical 
unwrapping. Algorithm 1 shows the procedure. 
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Algorithm 1 Modified Hierarchical Unwrapping 


Input: Wrapped phase maps Y, frequencies f (with fọ = 1), phase un- 
certainties o% 
Output: Fusion of unwrapped phase maps ® 
1: Set ®,.; = Po = Yo mod 27 
2: for n=0,1,..., N—1 do 
3: Get unwrapping factor 
p= In®ref-Pn 
n Qn 
Unwrap current phase map 


+2mk 
P, = tt mod 27 


Calculate new reference with circular mean of previous estimates 


8: (ee) 
9: end for 

10: Calculate circular mean of all unwrapped phases 
1: 6 = M’(®,o, 5) 


4.3.3 Modified PDM Unwrapping 


The PDM phase unwrapping method from Sec. 4.2.1.4 attempts to un- 
wrap the phase by minimizing the distance between the vector of phase 
measurements and a projected version of the same. This can be inter- 
preted as minimizing the distance between each unwrapped phase to 
the averaged unwrapped phase. By rewriting (4.24) it follows 


02 
|, et) 
Loads 
= > (®; u FE , (4.29) 


E 
|b — Pal? = | _ qi 


where ®, = y;+27k,; and where the frequencies are used as weighting fac- 
tor in Mean - To improve the method only three simple modifications are 
necessary. First, the weighting factor is replaced by the squared inverse 
of the frequency-dependent phase uncertainty o, s, - Further, to respect 
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the periodicity of the phase, the presented circular mean is used. And 
at last, instead of using a classical distance measure, a circular distance 


d (Ba, $p) = T — |T — |®, — D l| (4.30) 


is used, which returns the smallest distance in a periodic interval between 
two points ®,, ®, € [0, 27) . Phase unwrapping using this modified PDM 
method is then achieved by finding the optimal unwrapping factors 


k = argmin >> d ($, M° (®,a,¢))” . (4.31) 


4.4 Probabilistic Approach for Temporal Phase 
Unwrapping 


The modified hierarchical procedure presented in the previous section 
attempts to integrate the uncertainty measurement into the unwrapping 
as a first step by using it in the weighted average calculation, partially 
respecting the periodicity of the phase by using the proposed circular 
mean. However, a complete and simultaneous unwrapping of all phase 
measurements is not given here either. The proposed modified PDM 
unwrapping respects the periodicity of the phase. However, because 
all combinations of the unwrapping factors need to be evaluated, it is 
computationally extremely expensive. Moreover, it is by no means clear 
whether the minimization of the squared circular distance yields an 
optimal unwrapping result. Therefore, this work proposes a completely 
different idea that addresses the phase unwrapping problem through a 
probabilistic approach. 

In the field of phase unwrapping, probabilistic approaches have al- 
ready been used in the spatial domain. Carballo and Fieguth [32] and 
Koetter et al. [105] use a probabilistic approach to model the probability of 
a phase discontinuity in interferometric synthetic aperture radar (InSAR) 
images to use them as weight factors for a spatial phase unwrapping 
procedure. Droeschel etal. [48] use a similar approach for time-of-flight 
imaging. Baselice et al. [14] use an extended Kalman filter that includes 
probabilistic data to perform phase unwrapping and phase noise reduc- 
tion of InSAR data. 
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In contrast to these approaches, a probabilistic model for temporal 
phase unwrapping is proposed here. To solve the phase unwrapping 
problem optimally, an attempt is made to find the coordinate that has the 
highest probability of having caused the corresponding phase measure- 
ments. To formulate the unwrapping as a probability problem, the phase 
measurement is modeled as an appropriate stochastic process. This is 
used to determine the probability density of the encoded coordinate, 
find the optimal decoding by a maximum-likelihood approach, and thus 
implicitly and simultaneously compensate for the wrapping of all phase 
measurements. 


4.4.1 Probability Density Function of Phase-Shift Coding 


As indicated in Sec. 4.1.1, the variance of the image noise can be propa- 
gated through the phase-shifting algorithm. Thus, every measurement 
provides not only an estimate of the phase y but also the uncertainty 
a of this estimation. The probability density function of the true phase 
is therefore centered around the respective measurement. The question 
now arises which probability distribution the phase has. In principle, 
several distribution functions are possible. Since the image noise has a 
normal distribution, the first assumption is that the phase noise is also 
normally distributed. However, because the phase has a periodic struc- 
ture and is only defined on the interval [0, 277) , the probability density 
must be searched in the field of circular statistics [93]. 


4.4.1.1 Wrapped Normal Distribution 


The most intuitive approach to obtain a probability distribution of the 
phase is to assume a normal distribution 0 ~ N (1,07) and to allow its 
values to be spread on the entire set of real numbers 0 € R. By folding 
the density function around the unit circle 


p =0 mod 27, (4.32) 
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the range of values is then forced to the interval [0,27). The density 
function of the folded random variable is then the wrapped normal 
distribution [93] 


oe : 1 po -(p-u-2rk)2 
Pwn(p)= X` N (w+ 2nk,o?) = Dear, (433) 
k=—-00 V2n0 k=—00 


with the parameters u € [0, 27) und o°. The density function is symmet- 
ric and centered around the expected value u, whereas the width of the 
function is affected by the parameter ø . Since in practice the infinite sum 
must be terminated at some point, the literature provides more efficient 
representations of the distribution, e.g., 


Pwn(p) = £ (aye = cos (p(y — DE (4.34) 


where, depending on the choice of g? ‚the sum can be aborted after only 
a few terms [106]. 


4.4.1.2 von Mises Distribution 


A major disadvantage of the wrapped normal distribution is that it is 
quite intractable due to the infinite sum. Furthermore, it is not assured 
that a real phase measurement results from a folding operation on a linear 
normal distribution around the unit circle. Hence, it is not mandatory to 
assume that (4.32) is the correct description of the phase-shift coding. 

If the problem is approached with minimal knowledge, an alternative 
probability density function for the phase can be found. The available 
knowledge is: the expected value of the distribution corresponds to a 
phase measurement u, there is a measure of the second central moment 
© , and the phase should be defined on the periodic interval [0, 277) . The 
circular probability density function which maximizes the entropy under 
the given conditions and thus represents the ideal choice under these 
circumstances is the von Mises distribution [93] 


ef cos(p—u1) 


Pym (Y) = Drang , (4.35) 
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where /,(x) is the modified Bessel function of the first kind and order zero 


1 an kK cos(0 > K\2r (1 2 
n= f ea = S (£) (=) l (4.36) 


r=0 


The parameter u represents the expected value and x depicts a concen- 
tration measure that is analogous to the inverse of the variance in the 
normal distribution. Because of its mathematical simplicity, the von Mises 
distribution is one of the most commonly used distributions in circular 
statistics. And due to its great importance, it is also often referred to as 
the circular normal distribution [93]. 


4.4.1.3 Phase Noise Model of Phase-Shift Coding 


A more precise way to describe the probability density of the phase is 
to analyze the phase-shift coding directly. Rathjen [167] examines the 
random phase error arising from the normally distributed image noise 
of the sinusoidal pattern sequence. The two arguments of the arctan2 
function from (4.7) are described using a bivariate normal distribution, 
where the parameters of the distribution are computed from the nor- 
mal distribution of the image noise of the underlying pattern sequence. 
Finally, the distribution of the phase is calculated from this bivariate 
normal distribution, which applies to any phase-shift coding method. 

Depending on the algorithm, different distributions are obtained, 
which do not necessarily have to be symmetrical and which may also 
depend on the absolute value of the phase. For the symmetric M-step 
methods used in this work, the arguments of the arctan2-function are 
uncorrelated and have the same variance, leading to a symmetric distri- 
bution function for the phase that is independent of the absolute phase 
value [167]. The probability distribution function of the phase ọ for sym- 
metric M-step algorithms is then given by 


e SNR 


peule) = EL 1 + VENT cos(p — Wernemte-m 


rid 


(4.37) 
(A + erf ( SNR cos(y — )) Ji 
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where erf(2) = — [? e "dt is the Gaussian error function, the signal-to- 
vr 0 gn 


noise ratio SNR = =o," determines the width of the distribution, and 
u € [0, 277) represents the expected value. 


4.4.1.4 Comparison 


To identify which of the presented distributions is best suited for the 
problem, a Monte Carlo simulation of the phase measurement is per- 
formed. For this, the coordinate x = 0 is encoded using phase-shift 
coding and then the phase ¢ is measured. The simulation is performed 
107 times, each time adding Gaussian image noise corresponding to a 
phase uncertainty of o,, = 5 . The probability density of the phase noise 
can then be approximated using histogram analysis. 

Figure 4.2(a) shows the histogram of the measured phases and the 
different density functions whose parameters can be calculated from the 
phase-shift coding. As expected, the phase noise can best be described 
by the noise model of Rathjen [167]. However, the von Mises distribution 
also shows a reasonably good fit to the histogram, whereas the wrapped 
normal distribution is too low on the hills and too high in other areas of 
the histogram. 

Figure 4.2(b) shows the Jensen-Shannon distance (JSD) [49] between the 
histogram and each of the distributions over different phase uncertainty 
values as a similarity measure, i.e., a small JSD value corresponds toa high 
similarity. It can be seen that the model of Rathjen has a high similarity 
to the histogram for all uncertainty values. The von Mises distribution is 
also very close to the histogram and hence, represents the phase-shift 
coding sufficiently good, although the similarity is not constant for all 
noise values. Finally, compared to those two distributions, the wrapped 
normal distribution has a greater distance to the histogram. For small 
noise values ø, , all distributions converge into one another [93], so that 
they are almost equivalent, and for very large noise values everything 
converges to the uniform distribution on the interval [0, 27) . 


4.4.2 Compound Probability Density Function 


Because the individual phase measurement is affected by phase noise, 
the probability density of the true phase is hence centered around the 
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(a) Histogram of phase noise with u = 0 (b)Jensen-Shannon distance to the his- 
and o, = 5 and the three analyzed togram for different values of the phase 
probability density functions. uncertainty. 


Figure 4.2 Comparison of phase noise models. 


respective measured value. To consider all phase measurements simulta- 
neously in the unwrapping, depending on their respective uncertainty, 
it is necessary to search for the phase that caused the individual mea- 
surements with maximum probability. Since the phase has a periodic 
structure, the corresponding probability density must be modeled using 
circular statistics. 

The von Mises distribution is mathematically easy to handle, it ap- 
proximates the true distribution of the phase noise quite well, and a 
maximum-likelihood estimation can be performed in a numerically sta- 
ble way, cf. Sec. 4.4.3. Therefore, it will be used as the basis for modeling 
the phase measurement in the following. Modeling using the other den- 
sities would work analogously. 

The density function of the true phase p € [0, 27) as a function of the 
measurement is therefore given by 


efi cos(e-p;) 


p(plp,K;) = Bi (4.38) 


Here, the measured phase is represented by y; and x, = 1/ T, models 
the knowledge about the uncertainty of the phase measurement and 
thus describes the concentration of the distribution. Depending on the 
frequency of the pattern sequence, the distribution function of the en- 
coded coordinate x can now be derived. With y,(x) = 27 f;x and with 
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(a) Frequencies f = (2, 3, 6). (b) Frequencies f = (2, 4, 6). 


Figure 4.3 Rows 1-3: Multi-modal von Mises distributions for different frequencies. Row 
4: Compound probability density function and corresponding log-density. (a) The density 
has a unique maximum, since gcd(f) = 1. (b) The solution of the maximum-likelihood 
estimation has a two-fold ambiguity, since gcd (f) = 2 > 1. 


the known frequency f; , the multi-modal von Mises distribution on the 
periodic interval x € [0, 1) is obtained: 


kK, cos(2n f;2—¢;) 
Io(k;) 


Due to the multi-modal character of the distribution, the ambiguity 
of the phase measurement becomes illustratively visible in the density 
functions, see figure 4.3. 

Since the acquisition of the sinusoidal pattern sequence using phase- 
shift coding is performed independently for each image and identical 
acquisition conditions are assumed, each image has in principle the same 
standard deviation o; of the image noise. Therefore, the strength of the 
phase noise o, remains the same in each measurement. Nevertheless, the 
variable substitution y, (2) = 27 f;x reduces the width of the distribution 
locally by 1/ f; . This leads to a reduction of the uncertainty, which in 
turn is bought by an /;-fold ambiguity. 

While the image noise generally remains the same for all images, the 
estimated phase uncertainty can vary significantly for different situations. 
For example, if impulse noise appears in images, it is detected by the 
phase-shift coding as a reduction in the modulation B, which leads to 
an increase in the estimated uncertainty o, for the respective pixels. On 


e 
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the other hand, if the sinusoidal pattern is blurred due to the imaging 
system, the local contrast of the pattern sequence decreases. Again, the 
modulation B is affected and the uncertainty increases for the whole 
phase measurement. Of course, this is strongly influenced by the used 
pattern frequencies. The uncertainty estimate thus contains knowledge 
about the system and can therefore be integrated efficiently into the 
probabilistic modeling of the phase estimate. 

Depending on the chosen frequency of the sinusoidal pattern sequence, 
each phase measurement p,; corresponds to an individual probability 
distribution p(z|p;, Ki, f;). Since each phase measurement y; is mea- 
sured independently and all have the same underlying coordinate, the 
compound density of x for given frequencies f = (f},... , fy) , phase mea- 
surements y = ((,...,,), and estimated concentration parameters 
K = (K1,..., Ky) can be directly expressed: 


edi kK; cos(2m f,27—¢;) 


4.40 
II; Io(k;) 


p(zlp, k, f) = [[etelei si fi) = 


4.4.3 Maximum-Likelihood Phase Unwrapping 


Having described the probability density function of the multi-frequency 
phase-shift coding, this can now be used to find the most likely coordi- 
nate that caused the phase measurements. The optimal coordinate and 
thus the simultaneous unwrapping of all phase measurements can be 
found with a maximum-likelihood estimator. As a result, maximizing 
the density function yields the sought coordinate 


Typ = arg max p(x|¢, k, f) 
= arg max log (p(z|, s, f)) 
= arg max 5 k; cos (27 fix — y,;) — log Io(k;) 
= arg max 5 k; cos (2r fix — ;) . (4.41) 


The logarithm of the Bessel function can be ignored due to its inde- 
pendence of x, and the monotonicity of the logarithm helps to simplify 
the equations and removes the potentially numerically more unstable 
exponential function. 
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4.4.3.1 Uniqueness 


To be able to identify a unique maximum, constraints must be applied 
to the selected frequencies. With other unwrapping methods from the 
literature, uniqueness can be achieved if the frequencies are relatively 
prime [149, 255]. However, while all the frequencies need to be pairwise 
co-prime integers with gcd (f;, fj) = 1,Vi # j for classical number- 
theoretical approaches, the presented approach has a less restrictive 
condition. Here, uniqueness is obtained with gcd (f) = gcd (fy, fo, ...) < 
1, where the frequencies do not necessarily need to be integer-valued. 
For frequencies f; € Q, the extension of the gcd to rational numbers can 
be used to check uniqueness. For frequencies that are irrational numbers, 
the maximum of (4.41) is theoretically always unique if 3f; # f; with 
i # j. Though, in this case, when the frequencies are poorly chosen, the 
unwrapping might be more susceptible to noise. Figures 4.3(a) and 4.3(b) 
demonstrate the uniqueness constraint illustratively. In figure 4.3(a) the 
frequencies are set to f = (2,3,6), thus gcd (f) = 1. Even though, with 
ged(2,6) = 2 and ged(3,6) = 3, the frequencies are not pairwise co- 
prime, a unique maximum of the compound probability density can still 
be found. In figure 4.3(b) the frequencies are set to f = (2,4,6), thus 
gcd (f) = 2. Here the maximum has a two-fold ambiguity. The compound 
density is only unique in the range x € [0,0.5) and repeats itself in 
x € (0.5, 1) . Thus, in this case, the phase cannot be recovered uniquely. 


4.4.3.2 Finding the Maximum 


Although (4.41) seems simple, no analytical solution can be given for the 
global maximum because of the many local extrema. Therefore, the prob- 
lem must be solved numerically. However, no global optimizer (e.g., sim- 
ulated annealing, differential evolution) can be used because it could get 
stuck in a local maximum. To ensure that the maximum of the probability 
density is found every time and to avoid unwrapping errors, the optimiza- 
tion problem is solved on subintervals. To define the subintervals, (4.41) 
must be interpreted as a signal g(x) = >, K; cos (2a f;x — y,) . Since it is 
a summation of sinusoidal signals, the maximum frequency of the signal 
g(x) is equivalent to the maximum used frequency fmax = max(f;) in 
the phase-shift coding. From sampling theory, it is known that a discrete 
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signal can be reconstructed from its sampling points only if the signal 
does not change significantly between said points [31]. Consequently, the 
sampling frequency must be respected. Given the maximum frequency 
fmax and using the sampling theorem, a minimum required number of 
intervals I in = [2fmax| is obtained in which the global maximum must 
uniquely lie as a single extremum. A simple 1D line search procedure 
(see [140]) is now used to find the local maximum in each of those subin- 
tervals. A comparison of the local maxima of the intervals finally yields 
the global maximum. 

From a purely practical point of view, it would be sufficient to reduce 
the interval number to I in = | fmax | , Since only the local maxima are 
required and not the minima. Empirical investigations showed, however, 
that in rare cases nearly saddle point-like shapes appear in the signal. In 
these cases, two local maxima can lie very close to each other, and thus, 
with the reduced number of intervals, only one can be identified as a 
local maximum in the optimization. Nonetheless, the global maximum 
could always be found unambiguously in billions of simulations, since 
the signal changes very strongly in the vicinity of the global maximum, 
and thus, only a single solution exists in the interval under investigation. 

As a remark, it remains to say that the presented maximum-likelihood 
optimization can in principle also be carried out with the other distri- 
butions from Sec. 4.4.1. Though, since the log-likelihood function of 
the corresponding densities cannot be represented as a simple sum of 
cosine functions, the spectrum of these log-likelihood functions also 
has components at higher frequencies. Nevertheless, empirical investi- 
gations showed that higher frequencies are attenuated so strongly that 
the sampling theorem is almost fulfilled and hence a maximum could 
still be found every time. However, this could only be observed, when 
all Lain = [2fmax | subintervals were searched. Thus, the other density 
functions need twice the computation time as compared to the von Mises 
distribution. 

In summary, with the presented method, the wrapping of all phases 
is compensated simultaneously and all measurements are fused to an 
optimal solution so that finally the most likely value of the coordinate 
x can be found. 
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4.4.4 Spatio-Temporal Phase Unwrapping 


Temporal phase unwrapping has the great advantage that each pixel can 
be individually unwrapped and an absolute phase is obtained. This is 
especially useful when only a little information about the surface to be 
examined is known and when 2D unwrapping methods would lead to 
erroneous results. In many tasks of optical metrology, where structured 
illumination is used, continuous surfaces are often examined. For ex- 
ample, deflectometry often works with lacquered body parts from the 
automotive industry, with lenses, or parabolic mirrors, which can be de- 
scribed for the most part as continuous surfaces with only a few regions 
deviating from this continuity due to sharp edges. Also in time-of-flight 
imaging and many areas of profilometry, i.e., fringe projection, piecewise 
continuous objects are often inspected [47, 209]. This piecewise conti- 
nuity has the consequence that neighboring camera pixels will observe 
similar phase values on the surface. It is therefore reasonable to use this 
additional information to help with the phase unwrapping to suppress 
phase errors. 

The assumption of local continuity should be integrated into the prob- 
abilistic framework from the previous section. This allows performing 
not only an unwrapping in the temporal dimension but a 3D phase un- 
wrapping while implicitly smoothing the probability density functions 
over the spatial dimensions. To do this, the probability density of each 
camera pixel is modeled as a superposition of the probability densities of 
the local neighborhood. The probability density for each individual pixel 
u was already derived in the previous section and can be considered as 
a conditional ae 


p(2(u)|e(u) = Ir w)|p;(u), £0), fi). (4.42) 


If neighboring pixels can no longer be considered independently of 
each other, then the probability density results in a weighted superposi- 
tion of individual densities for each pixel u 


p(z(u)):= ), rlulü)p(z(ü)lp(ü),r(ü),f), (4.43) 
ueuU(u) 


where U (u) represents a set of relevant neighborhood pixels. Since more 
distant pixels have less influence and the modeling should be approached 
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with minimal knowledge about the observed surface, the transition prob- 
abilities are modeled using a 2D normal distribution 


_lu-äl? 
p(ujä) = N (a, 021) = = e R, (4.44) 
N 


The compound density, consisting of a spatial modeling by means of 
normal distributions and a temporal modeling by means of von Mises 
distributions, finally results in 


exp (-=£) exp (E x, (U) cos (2r f;x — p;(ü)) ) 


2 
20oÑ 


ûcU(u) 2nox I]; (rà) 


p(x(u)) = 


(4.45) 

Although this probability density appears more complicated than the 
equation (4.40) from the previous section, it can be maximized using 
the same methods for finding the optimal solution of the coordinate: 
Zur(u) = arg max p(x(u)). 

However, it must be considered that this approach only leads to mean- 
ingful results if the local continuity assumption is not violated. To ensure 
that the given model is only applied in continuous areas, discontinuities 
have to be detected. 


4.4.4.1 Detection of Discontinuities 


Depending on the application, discontinuities in a surface can lead to 
discontinuities in the phase map. In the case of profilometry, a step in 
the surface results in a step in the phase map, whereas a step in the 
surface gradient does not necessarily destroy the continuity of the phase 
map. However, in the case of a deflectometric measurement of specular 
surfaces, even a step in the surface gradient may result in a step in the 
phase map. Consequently, this means that it is not the intention to detect 
edges on the surface but discontinuities in the unwrapped phase. 

For edge detection, a simple detector operating directly on the wrapped 
phase estimates is suitable for this purpose. Nonetheless, since the 27- 
discontinuities contained in the phase maps do not represent a property 
of the surface, they must not be falsely detected, cf. figure 4.10. Thus, 
a 2r-invariant detector is needed. Typically, gradient-based operators 
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are utilized to detect edges in images. For this, the Laplace operator 
Aylu) = 0,29(u)? + 0,29(u)? is often used. However, for a 27-phase 
jump in the wrapped phase, the operator will yield a multiple of 27 
even when the correctly unwrapped phase would have only a small con- 
tinuous change. To have this property ignored, a 2r-invariant Laplace 
operator is defined 


A,,,¢(u) := Ayplu) mod 27, (4.46) 


which is only sensitive to phase discontinuities in the unwrapped phase 
caused by the surface, whereas discontinuities that are caused by the 
ambiguity of awrapped phase are ignored. To reduce the effect of noise 
in edge detection, a Laplacian of Gaussian may be used. Equation (4.46) 
can take values within the periodic interval [0, 27) . However, since the 
strength of an edge is defined as the distance to 0, it is necessary to 
calculate the circular distance for an appropriate edge quality measure. 
Hence, for every phase measurement y,(u), an energy measure 


E,(u) = d (0, A.9;(u)) = T — |r — [As 9; (u)|| (4.47) 


is calculated in which the maximum possible circular distance is equal 
to m, which would correspond to a strong edge feature. Further, an ap- 
propriate averaging over all phase maps improves the edge estimate 


E(u) = 24%, (WE) ; (4.48) 


X; og, (u) 


where the uncertainty of the phase estimate can be taken into account. 
Hence, the application of the modified Laplacian operator ultimately 
provides an energy measure for an edge, which is insensitive towards 
2r-discontinuities. And finally, subsequent thresholding on this energy 
measure results in a feature map containing edge areas and non-edge 
areas, see figure 4.10. In places where an edge has been detected, the tem- 
poral modeling according to Sec. 4.4.2 must be used, whereas everywhere 
else the modeling according to Sec. 4.4.4 may be used to improve the 
phase unwrapping by utilizing the spatial neighborhood information. 
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4.5 Evaluation 


In this section, the presented methods are evaluated, analyzed, and com- 
pared with the state of the art. Sinusoidal pattern sequences with different 
frequencies are simulated and the respective phase is estimated using 
phase-shift coding, where the number of steps is chosen to be M = 8. The 
following unwrapping methods are examined: The hierarchical method 
of Huntley and Saldner [88], the proposed modified hierarchical method, 
the heterodyne method of Lai et al. [107], the number-theoretical method 
of Towers et al. [202], the PDM method of Zuo et al. [255], the proposed 
modified PDM method, the proposed probabilistic temporal method, 
and the proposed probabilistic spatio-temporal method. For the pro- 
posed probabilistic methods, the von Mises probability density is used, 
unless specified otherwise. For the spatio-temporal method a spatial 
neighborhood U(u) of 3 x 3 pixels is used. To investigate the robustness 
of the presented phase unwrapping algorithms, the influence of Gaussian 
image noise and impulse noise is examined. 


4.5.1 Qualitative Comparison 


The resolution of the reference pattern generator was set to (2003, 2003). 
For the first simulation three phase measurements with frequencies 
f ~ (1, 3,5) were generated. Because for the number-theoretical method 
pairwise co-prime wavelengths must be used, the wavelengths are quan- 
tized as A = (2003, 668, 401) . This corresponds to the set of frequencies 
f ~ (1,2.999, 4.995). Nevertheless, since no methods are restricted to 
integer frequencies, this does not result in any major disadvantages. The 
phase uncertainty was chosen to be o,, = 0.25 rad = 14.3°. Using (4.10), 
Gaussian noise with variance o? = o,B’M /2 was added to the sinu- 
soidal pattern sequence. It is important to note thatthe noise isnot added 
to the wrapped phase measurements, as it is often done in the literature, 
but to the camera images J,,,, otherwise no realistic statements about 
phase-shift coding can be made. The heterodyne method calculates a 
phase difference to obtain a unique reference phase. Since g; is already 
unique, it does not make sense to evaluate the heterodyne method for 
this frequency configuration. The coordinate x € [0,1) was sampled 
in 2003 steps and each value was simulated 2003 times. Figure 4.4(a) 
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Figure 4.4 Top: Noisy phase measurements 9, with frequencies f ~ (1, 3,5). Bottom: 
Estimated coordinate ¢ for different methods. 


shows the phase measurements and the coordinates estimated with the 
different unwrapping methods. Here, stronger colors represent a higher 
point density. The upper three plots show the noisy phase measurements 
p; over the true coordinate x . The lower plots show the corresponding 
estimated coordinates £ over x. 

The hierarchical unwrapping shows a line of correctly unwrapped 
estimates in the middle section. At the boundaries of the coding interval, 
large errors appear because the periodicity of the phase is not implicitly 
modeled for this method. For these reasons, the effective coding interval 
is often reduced in practical applications. This avoids unwrapping errors, 
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but the effectively used frequency decreases, which increases the overall 
phase uncertainty. In addition to the boundary errors, the hierarchical 
method shows unwrapping errors that are represented by parallel lines 
to the middle line. In these cases, the phase was incorrectly unwrapped 
once or even twice. Since the hierarchical method always refers back to 
the unwrapped previous phase, unwrapping errors propagate from top 
to bottom and can no longer be compensated once they occurred. Using 
the modified hierarchical method, boundary errors can be significantly re- 
duced, due to respecting the circularity of the phase by using the circular 
mean. Similar to the standard hierarchical method, parallel lines of un- 
wrapping errors appear, since the second phase is unwrapped by only us- 
ing the first measurement. The number-theoretical unwrapping shows no 
errors at the boundary of the coding interval since the Chinese remainder 
theorem is based on modulo arithmetic. Also, only a few unwrapping er- 
rors occur in the middle of the coding interval. Two lines of faulty estima- 
tions appear near the boundary, where noisy estimates x > 1 and x < 0 
are folded back into the used interval [0, 1) . Overall, the method is more 
susceptible to noise, which results in a coordinate estimation with greater 
uncertainty. The PDM unwrapping is much better. Almost all pixels in 
the middle of the coding interval are unwrapped correctly. The errors at 
the boundary are caused by the lack of modeling of the periodicity of the 
phase. By using the modified PDM method, these wrongly unwrapped 
pixels can be corrected. The proposed probabilistic temporal method can 
also compensate for the boundary errors since the periodicity is well de- 
scribed using circular statistics and it performs satisfactorily in other areas 
as well. Only a few pixels are unwrapped incorrectly. Finally, the spatio- 
temporal method yields even better results. Here almost all values are 
unwrapped correctly and the uncertainty of the estimation is the smallest 
compared to the other methods, as can be seen by the overall thinner line. 
In other words, it makes a lot of sense to include spatial information. 

In a second simulation, the sinusoidal pattern sequence is superim- 
posed with impulse noise, where the probability of an impulse is set to 
py = 0.15. An impulse in the image appears either as a black pixel or as 
a white pixel, i.e., it acts like salt and pepper noise. Again, of course, the 
noise must be added to the sinusoidal pattern sequence J,,, and not to 
the wrapped phase maps. Although 15% of the pixels show an impulse, 
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the same number of phase estimates may not necessarily be affected. Fig- 
ure 4.4(b) shows the phase measurements and the coordinates estimated 
with the different unwrapping methods. An impulse in the pattern se- 
quence causes the respective phase to be distorted to a greater or lesser ex- 
tent, depending on how far the impulse is from the correct intensity value. 

Again, the hierarchical method shows similar effects as before, which 
are, however, more prominent here. The modified hierarchical method 
is slightly better than the standard approach and also the errors at the 
boundaries are smaller. The number-theoretical method can still provide 
a good unwrapping performance. Due to the generally higher noise 
level, the accuracy decreases. For the PDM method, more erroneous 
estimations occur, comparable to the hierarchical method, whereas the 
modified PDM method reduces the boundary errors. As opposed to this, 
the two probabilistic methods can still achieve very good results. This can 
be explained by considering that a phase measurement with an impulse 
in the pattern sequence has a smaller modulation B. This also increases 
the corresponding estimate of the phase uncertainty. This estimate can 
be used directly in the proposed methods to compensate for poor phase 
measurements. Thus, better phase measurements have more influence 
on the optimization. For the spatio-temporal method, this means that a 
distortion of the phase has an effect only if a large number of the pixels 
in the respective 3 x 3 x 3 cube is disturbed. Since the probability of this 
is quite low, the method yields almost no errors. 

To evaluate the heterodyne method, phase measurements with fre- 
quencies f ~ (6,9, 11) are generated. For the same reasons as before, the 
wavelengths were quantized as A = (331, 223, 181). This corresponds to 
frequencies f ~ (6.051, 8.982, 11.066) . Image noise is superimposed on 
the sinusoidal pattern sequence, corresponding to a phase uncertainty 
of o, = 0.15 rad = 8.6°. Since the hierarchical and modified hierarchical 
method can uniquely unwrap the phases only up to the first period, they 
are not considered in this comparison. Figure 4.5(a) shows the phase 
measurements and the coordinates estimated with the different methods. 

It can be seen that even with smaller noise than before, the hetero- 
dyne method delivers only mediocre results. A large part of the pixels 
is unwrapped correctly, though many lines of incorrect values appear 
parallel to the correct line. This is due to the fact that the phase noise 
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(a) Gaussian noise with o, = 0.15 rad. (b) Impulse noise in 5% of the pixels. 


Figure 4.5 Top: Noisy phase measurements p; with frequencies f ~ (6, 9, 11) . Bottom: 
Estimated coordinate ĉ for different methods. 


is summed up when calculating the phase difference. To get a unique 
solution, first 915 = Yı — Yə with fig = fo — fı ~ 2.93 and yo, with 
fo3 = 2.08 are calculated. A unique phase can then be calculated with 
P123 = P23 — Pig With fi23 = fio — fog = 0.85. This is then used to 
unwrap the individual phase measurements. However, since the noise is 
summed up in each step, the reference phase is of poor quality, resulting 
in a poor overall unwrapping result. Surprisingly, the number-theoretical 
method fails completely. The integer arithmetic of the method cannot 
work even at a very small noise level. The PDM method and the modified 
PDM method show almost the same very good result, with only a few 
boundary errors and two small clusters of erroneous estimates. As before, 
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the presented probabilistic methods provide very good results, whereas 
the spatio-temporal approach yields almost only correct estimates. 

For a final evaluation, the sinusoidal pattern sequence was now again 
overlaid with impulse noise, with the probability of an impulse set to 
py = 0.05. Figure 4.5(b) shows the phase measurements and the coordi- 
nates estimated with the different methods. Although the noise is very 
small, the heterodyne method again shows many unwrapping errors. 
The number-theoretical method delivers bad values too. Here we see that 
almost only pixels without distortion are unwrapped correctly, visible by 
the somewhat stronger line in the center. The PDM method can unwrap 
the phase very well, as before. Errors are still found at the boundary and 
in parts in the center, whereas the modified version better compensates 
for the boundary errors. The presented probabilistic methods show al- 
most perfect results, which is explainable by the incorporation of the 
estimated phase uncertainty in the unwrapping process. 


4.5.2 Robustness Against Noise 


The methods presented are now being evaluated quantitatively. For this 
purpose, the robustness of the methods against Gaussian noise and 
impulse noise will be investigated. In order to compare all methods, 
sinusoidal pattern sequences with M = 8 phase shifts were simulated. 
Subsequently, various noise factors were superimposed on the images, 
the phase was estimated using phase-shift coding, and finally, the phases 
were unwrapped using the presented methods. 


4.5.2.1 Error Metrics 


In order to make quantitative statements about the methods, suitable 
error metrics have to be defined beforehand. As a first error measure, the 
estimation error 

Eq = |E — Bruel (4.49) 


defines the absolute distance of the estimated coordinate x to the true 
coordinate True - The second error metric evaluates the quality of phase 
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unwrapping and describes the success rate, representing whether a pixel 
was correctly unwrapped: 


N-1 


pe (4.50) 


1 


5,7 


2|= 


where C; indicates whether the phase measurement associated with the 
frequency f; has been correctly unwrapped 


1 k; —k,| =0 
C, = ’ | i, true il , (4.51) 
0, otherwise 
1 ’ + > rue +ð, 
= 27, > Merwe = m Pal (4.52) 
0, otherwise 


Because the proposed methods do not directly unwrap the individual 
phase measurements but return a global solution, (4.52) is used with 
C; = Cia and i nax = arg;max f;. Hence, any phase value that is farther 
away from the true solution than 1 /(2f max) is therefore classified as an 
unwrapping error. 


4.5.2.2 Error Evaluation 


For a first analysis, the frequencies of the sinusoidal pattern sequence 
were again chosen to be f ~ (1, 2.999, 4.995) to create integer wavelengths 
A = (2003, 668, 401) to ensure that the number-theoretical method can 
be used. The robustness towards Gaussian image noise was analyzed 
by increasing the phase uncertainty incrementally from o, = 0 toa, = 
0.5rad ~ 28.6° in 100 steps. For the analysis of robustness to impulse 
noise, the probability of an impulse was increased stepwise from p; = 0 
to p; = 20% in 100 steps. Figure 4.6 shows the results of the analysis as a 
plot of the mean estimation error e, and mean success rate s, . 

The evaluation of the phase error metrics yields similar results as the 
evaluation of the qualitative results from the previous section, for both 
Gaussian noise and impulse noise. When analyzing the influence of Gaus- 
sian noise, large differences between the methods can be observed. The 
number-theoretical method consistently yields the worst results with 
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Figure 4.6 Evaluation of phase error e, and success rate s, with f ~ (1,3, 5) for dif- 
ferent phase unwrapping methods: = Number-theoretical, = hierarchical, = modified 
hierarchical, = PDM, = modified PDM, = probabilistic (temporal), = probabilistic (spatio- 
temporal). 


the largest estimation error. The success rate is also consistently the low- 
est, mainly caused by the erroneous unwrapping at the boundaries of 
the coding interval. Interestingly, the hierarchical method and the PDM 
method show almost identical behavior up to about o, = 0.2. Only for 
higher noise levels, the advantage of the PDM method becomes apparent, 
resulting in a lower estimation error and a higher success rate. The modi- 
fied hierarchical method and the modified PDM method show the same 
behavior as the probabilistic temporal method for lower noise levels. For 
high noise levels, the modified hierarchical method becomes compara- 
ble to the standard PDM method. The proposed probabilistic methods 
provide the best results with the smallest estimation error and highest 
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success rate, where even for very large noise levels the spatio-temporal 
method can still correctly unwrap more than 99.9% of the pixels. The 
same conclusions can be made for the analysis of impulse noise. Here it 
becomes even possible to order the methods from worst to best directly 
by looking at the plots: number-theoretical, hierarchical, PDM, modi- 
fied hierarchical, modified PDM, probabilistic temporal, probabilistic 
spatio-temporal. 

For the second analysis, the frequencies of the sinusoidal pattern se- 
quence were again chosen to be f ~ (6.051, 8.982, 11.066) to create inte- 
ger wavelengths A = (331, 223,181) suitable for the number-theoretical 
method. The noise was parameterized in the same way as before. Fig- 
ure 4.7 shows the results of the analysis as a plot of the mean phase error 
e„ and mean success rate s, . While analyzing the influence of Gaussian 
noise, it can be seen that the number-theoretical method is extremely 
susceptible to noise. It can only deliver correct values for very small noise 
values. Starting from a noise of ø, ~ 0.02 it has already reached the 
maximum possible mean error. For small noise levels, the heterodyne 
method still shows very good results and can keep up with the other 
methods. Only for larger noise, significant deficiencies become apparent. 
For the investigated frequency configuration, the standard and the modi- 
fied PDM method have an almost identical success rate, which is only 
slightly worse as compared to the probabilistic temporal method. Also, 
the probabilistic method is slightly better for low noise levels resulting in 
a smaller estimation error. For large noise levels, all yield almost the same 
result. The spatio-temporal method, on the other hand, still yields very 
good results for high noise levels even when a phase-shift configuration 
is used consisting of high frequencies, where in general the success rate 
is more susceptible to noise. 

The analysis of the impulse noise emphasizes again the advantages of 
the proposed methods. The number-theoretical method and the hetero- 
dyne method are very susceptible to impulse noise. Even small amounts 
of noise cause the success rate to drop steeply and the estimation error to 
rise significantly. The PDM method and the modified PDM method show 
similar behavior, with the modified method being slightly better. Again, 
the probabilistic temporal method gives better results than the classical 
approaches for all noise levels. Interestingly, the spatio-temporal method 
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Figure 4.7 Evaluation of phase error e, and success rate s,, with f ~ (6, 9, 11) for different 
phase unwrapping methods: = Number-theoretical, = heterodyne, = PDM, = modified 
PDM, = probabilistic (temporal), = probabilistic (spatio-temporal). 


shows exceptionally good results here. Even with p; = 20% impulse 
noise, the success rate is still greater than 99.99 % . This can be explained 
by the fact that a coordinate estimation is only disturbed if a certain 
number of phase measurements are influenced by an impulse. Since 
the spatio-temporal method combines 27 probability densities for each 
coordinate estimate, the probability that a large part of these densities is 
disturbed is very small. To obtain a correct estimate, at least one pixel of 
the 3 x 3 spatial neighborhood must be correct for only two of the three 
phase measurements, since the corresponding frequencies are pairwise 
co-prime and effectively two phase measurements are sufficient to get 
a unique result. The probability of an unwrapping error at p; = 20% 


73 


4 Deflectometric Registration 


impulse noise with a spatial neighborhood of $ = 9 pixels, N = 3 fre- 
quencies, and pairwise co-prime frequencies is therefore approximately 


1— 9) (a) G-A oP)" = N12) RY + OY” igs 


n=2 


= 0.277 + 3 - 0.2"8 (1 — 0.29) z6.3:10°°. 


The probability may be even lower since not every impulse necessarily 
causes an erroneous measurement. 


4.5.3 Comparison of Different Phase Noise Models 


To confirm the choice of the von Mises distribution as a representative for 
the probability density of the phase, this section compares the different 
densities. 


4.5.3.1 Robustness against Model Errors 


The phase uncertainty is in principle not known but has to be estimated 
by using the standard deviation o, of the underlying image noise. How- 
ever, since this is either set arbitrarily or has to be estimated from the 
camera parameters, model errors may be introduced. To investigate the 
robustness against these model errors, a phase measurement with image 
noise o; = 0.3 is simulated. The different probability densities are param- 
eterized with the incorrect 6, = 6,,0; where the relative deviation ô,, 
describes the model error. Figure 4.8 shows the influence of the model 
error on the temporal and spatio-temporal phase unwrapping. For the 
temporal approach, the log-likelihood of the von Mises function is used. 
Here, the phase unwrapping is completely independent of the model 
error. Because the image noise has only a multiplicative influence on 
the estimated phase uncertainty, this factor can be extracted from the 
objective function (4.41) and has no significant influence on the maxi- 
mization. However, when the other distributions or the spatio-temporal 
approach is used, the situation is different. Here, the influence of model 
errors as well as numerical instabilities become apparent. For 6,, < 0.5 
the error for the von Mises density and the phase-shift model becomes 
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Figure 4.8 Evaluation of model deviations for different probability density functions: 
== Wrapped Gaussian, == von Mises, == von Mises (log-likelihood), == Phase-shift model. 


larger and sometimes the optimization of the phase-shift model fails 
so that no result can be obtained. For too small uncertainties, the terms 
in the exponential functions of the density (4.37) become too large, and 
thus numerically bad or even invalid values can occur. The effects are 
even stronger for the spatio-temporal approach. The wrapped Gaus- 
sian density, on the other hand, shows good results only starting from 
ôs, = 1, whereas the results deteriorate again starting from ö,, ~ 2. In 
summary, the log-likelihood approach is completely robust to model 
errors, whereas the von Mises density and the phase-shift model show 
poor results only for very small values. To avoid numerical instabilities, 
the assumed image noise o; should therefore have a lower bound, since 
it does not have a significant influence on the result. Nevertheless, the 
relative difference of the phase uncertainty o,, and the influence of the 
frequencies are of course still important. 


4.5.3.2 Robustness against Noise 


With the optimal o; selected, the probabilistic temporal phase unwrap- 
ping is now analyzed in more detail. Table 4.1 shows the estimation error 
and the success rate for Gaussian image noise with o,, = 0.3 and impulse 
noise with p; = 0.03 for different probability densities. For reference, the 
PDM method is shown too. As expected, the phase-shift model according 
to Rathjen [167] gives the best results for the Gaussian noise and the von 
Mises distribution the second best. Compared to the PDM method, the 
probabilistic methods differ only minimally. Interestingly, for impulse 
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Table 4.1 Comparison of different phase noise models. 


f = (1,3,5) f = (6,9, 11) 

Method 10-e, s,in% 10-€, s,in% 

= PDM 0.772 99.050 1.199 95.457 

a Wrapped normal 0.453 99.490 1.032 95.658 
= von Mises 0.438 99.526 0.912 96.275 

O  Phase-shift-model 0.425 99.549 0.862 96.505 
o PDM 0.157 99.839 0.148 99.514 
2 Wrapped normal 0.086 99.928 0.058 99.812 
= von Mises 0.086 99.928 0.059 99.811 

=  Phase-shift-model 0.091 99.925 0.068 99.781 


noise, the wrapped normal distribution performs better than the model 
of Rathjen. Again, the von Mises distribution provides the second-best 
results, which is only insignificantly worse than the wrapped normal 
distribution. So, apart from the other advantages of the von Mises distri- 
bution, it therefore turns out to also be a good compromise to be robust 
against Gaussian and impulse noise. 


4.5.4 Phase Map Reconstruction 


This section shows how phase maps are reconstructed using the pre- 
sented unwrapping methods. For this purpose, two phase maps (512x512 
pixels) are generated, see figure 4.9. Phase map 1 shows a continuous 
surface with hills and valleys, whereas phase map 2 represents a dis- 
continuous surface that has sharp edges. The corresponding sinusoidal 
pattern sequences are generated with wavelengths A = (331, 223,181), 
corresponding to frequencies f ~ (6.051, 8.982, 11.066) . The pattern im- 
ages are superimposed with Gaussian noise corresponding to o, = 0.15. 
Figure 4.9 shows the generated phase-shift image data and the wrapped 
phase maps that are calculated using phase-shift coding. 
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Figure 4.9 (a) & (b) show the true phase maps. (c) & (d) show the noisy sinusoidal patterns 
for phase offset Y = 0 for the frequencies f ~ (6,9, 11)" , from left to right respectively. 
(e) & (f) show the corresponding noisy phase maps with phase noise o „ = 0.15. 


(d) 


(f) 


4.5.4.1 Edge Detection Example 


Because the presented spatio-temporal method may not be used across 
discontinuities, edges in the phase map must be detected first. The appli- 
cation of edge-detection to the phase maps is shown in figure 4.10. Fig- 
ure 4.10(a) and figure 4.10(c) each show the application of the presented 
2r-invariant edge detector to the wrapped phase maps. Figure 4.10(b) and 
figure 4.10(d) show the output of an edge detector that uses a standard 
Laplacian and a standard absolute distance to obtain the edge instead of 
the proposed 2r-invariant operations. It can be seen that the standard 
detector not only detects the edges in the phase map but also the phase 
jumps caused by the wrapping of the phase values. The presented detec- 
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(a) (b) (c) (d) 


Figure 4.10 Edge detection in phase maps: (a) & (c) show the proposed edge detection 
for phase maps 1 and 2, respectively. (b) & (d) show the result of a standard edge detection. 


tor, on the other hand, detects only the real edges in the phase map. Since 
phase map 1 is continuous, no edge is detected. Only a few individual 
pixels are detected as edges since the edge detector is of course also influ- 
enced by the noise. The spiral shape of phase map 2 can be detected very 
well and in addition, only a few individual pixels are incorrectly detected 
as an edge. An optimization of the thresholding parameter in the edge 
detection could resolve those wrongly detected pixels. However, even a 
wrongly detected edge may not cause a faulty phase unwrapping, since 
edge pixels are then “just” unwrapped using the probabilistic temporal 
method that instead of the spatio-temporal method still performs very 
well. 


4.5.4.2 Phase Reconsiruction 


The results of the unwrapping of phase map 1 are shown in figure 4.11 
for the heterodyne unwrapping, the PDM unwrapping, the proposed 
temporal unwrapping, and the proposed spatio-temporal unwrapping, 
respectively. The top row shows the reconstruction as a 3D plot, with 
the linearly increasing phase ramp subtracted for better visibility. The 
middle row shows the reconstructed phase and the bottom row shows 
the respective error. 

It can be seen that the heterodyne method works only suboptimally. 
The total error is quite high and only 78.19 % of the pixels are correctly 
unwrapped. The reconstructed phase map looks very noisy. The PDM 
method, on the other hand, yields 99.89 % correct pixels and thus pro- 
vides a far smoother phase reconstruction. Single unwrapping errors oc- 
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Figure 4.11 Reconstruction of phase map 1 influenced by Gaussian noise with o,, = 
0.15 rad. The top row shows the phase reconstruction as 3D plot, where the linear phase 
ramp is removed. The middle row shows the reconstructed phase. The bottom row shows 
the phase error. (a) Heterodyne: €, = 0.399 , s, = 78.19%. (b) PDM: e, = 0.011, s, = 
99.89 % . (c) Probabilistic temporal: €,, = 0.008, s, = 99.99 % . (d) Probabilistic spatio- 
temporal: e, = 0.003, s, = 100%. 


cur for phase values close to 0 and 277, e.g., near the large hill and the deep 
valley. In addition, some unwrapping errors occur in the center of the 
phase map near lines where the wrapped phases show 27-discontinuities. 
Initially, these errors cannot be explained directly. However, as indi- 
cated by Petković et al. [149], their PDM method performs worse for non- 
integer frequencies, which therefore could be the cause. The proposed. 
probabilistic temporal method can correctly unwrap 99.99 % of the pixels. 
Similar to the PDM method, isolated errors occur for values near the 
boundaries of the coding interval. The errors along the 27-discontinuities 
of the wrapped phases do not occur here and show that the proposed 
method also works properly for rational frequencies. The proposed prob- 
abilistic spatio-temporal method can correctly unwrap all pixels. At the 
same time, the general accuracy is higher, as can be seen in the error map 
by the overall darker green color. Thus, the local information used in the 
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Figure 4.12 Reconstruction of phase map 2 influenced by Gaussian noise with o, 

0.15 rad. The top row shows the phase reconstruction as 3D plot. The middle row shows 
the reconstructed phase. The bottom row shows the phase error. (a) Heterodyne: e,, = 
0.418,s, = 78.35%. (b) PDM: e, = 0.031, s, = 99.24%. (c) Probabilistic temporal: 
€, = 0.014, s, = 99.90%. (d) Probabilistic spatio-temporal: e, = 0.005, s, = 99.97 %. 


3 


(b) 


maximum-likelihood estimation not only improves the success rate of 
the unwrapping but also acts as a denoising filter and therefore leads to 
lower uncertainty in the estimated coordinate. 

Figure 4.12 shows the results of the unwrapping of phase map 2, again, 
for the heterodyne unwrapping, the PDM unwrapping, the presented 
temporal and spatio-temporal unwrapping, respectively. Here again, the 
heterodyne method performs significantly worse than the other methods. 
Only 78.35 % of the pixels can be unwrapped correctly and the estima- 
tion error is very high. As before, the PDM method shows errors at the 
boundaries of the coding interval, which appear at the right edge of 
the spiral and the right side of the phase map. In addition, unwrap- 
ping errors occur near the 27-discontinuities of the wrapped phases, 
which could be caused by the non-integer frequencies. The probabilistic 
temporal method again shows smaller errors at the boundaries of the 
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coding interval and faulty lines as in the PDM method do not appear. The 
spatio-temporal method again shows an overall smaller error and can 
correctly unwrap almost all pixels. The error map shows that the pixels 
along the edge of the spiral have a larger error. This can be explained 
by the fact that for these pixels the continuity assumption of the surface 
is violated and these pixels were detected as an edge, see figure 4.10(c). 
Wherever an edge is detected, the temporal method is used, everywhere 
else the spatio-temporal method helps to improve the estimation. Further, 
the spatio-temporal method is more robust against unwrapping errors, 
which can also be seen in the error map at the right edge of the spiral. 
Here, only unwrapping errors occur exactly on the edge. The pixels away 
from the edge can be correctly unwrapped. 


4.6 Summary 


This chapter aimed to find a way to measure the deflectometric imaging 
function, which is needed for deflectometry as well as for the camera cali- 
bration presented in this thesis. An optical encoding utilizing phase-shift 
coding was discussed, which allows finding a direct mapping of camera 
pixels to points in the plane of the monitor screen. For the decoding 
of the monitor coordinates different phase unwrapping methods were 
presented. In addition, approaches were discussed on how the classical 
phase unwrapping methods can be improved. The main contribution 
of this chapter is a new probabilistic approach for phase unwrapping 
that uses circular statistics to describe the phase-shift coding. The pre- 
sented method unwraps all phase measurements simultaneously by 
finding the coordinate that had the maximum probability to cause the 
phase measurements. Using circular statistics, both the periodicity of 
the phase is taken into account and the estimation of the phase uncer- 
tainty can be included in the unwrapping process, thus automatically 
compensating for individual erroneous phase measurements. This is 
achieved by expressing the individual phase measurements as appropri- 
ate stochastic variables, where different distributions were investigated 
to describe them. Using this, the probability density of the encoded 
coordinate could be determined, which allowed finding the optimal de- 
coding by a maximum-likelihood approach. Thus, it became possible 
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to implicitly and simultaneously compensate for the wrapping of all 
phase measurements. Furthermore, it was demonstrated how to extend 
the presented probabilistic method to a spatio-temporal approach by 
integrating a local surface continuity assumption into the framework 
and modeling the local pixel neighborhood. This results in an implicit 
smoothing of the probability densities over the spatial dimensions. To 
ensure the assumptions are not violated, a modified edge detector is 
used to detect discontinuities in the surface and exclude them from the 
spatial modeling. 

Simulations compared the presented methods with state-of-the-art 
temporal phase unwrapping algorithms and investigated the effect of 
different noise types. The results showed that the proposed probabilistic 
methods are noticeably more robust against noise. This provides the 
ability to increase the acquisition speed of the optical encoding by using 
phase-shift coding with fewer shifts, where the noise level is generally 
higher. It was also shown that the proposed methods allow a relatively 
free choice in the range of frequencies of the sinusoidal pattern sequence 
so that even rational frequencies yield good results. At the same time, 
it was demonstrated that by modeling the periodicity using circular 
probability densities, the unwrapping errors at the boundary of the cod- 
ing interval can be significantly reduced. In addition, the inclusion of 
the phase uncertainty allows to automatically compensate for too noisy 
phase measurements, making the presented methods very robust against 
impulse noise. Although the von Mises distribution does not ideally de- 
scribe the phase noise, it handles impulse distortions better than the 
model of Rathjen [167] and thus proves to be a suitable compromise to 
compensate well for both Gaussian noise and impulse noise at the same 
time. Because the image noise is in general unknown, model errors may 
be introduced. Nevertheless, the von Mises distribution again proved 
to be robust towards such errors. Finally, the extension of the temporal 
approach to a spatio-temporal approach can considerably increase the 
robustness of the method even further, eventually leading to improved 
accuracy of the camera-to-monitor registration. This provides ideal start- 
ing conditions for subsequent camera calibration procedures and the 
deflectometric reconstruction of specular surfaces. 


82 


5 System Calibration 


In order to carry out a deflectometric measurement for specular surface 
reconstruction, it is not sufficient to measure only the simple imaging 
function as a registration between camera and monitor. With the registra- 
tion, we may know a mapping of camera pixels to monitor coordinates, 
but the geometry of the scene cannot be reconstructed without knowing 
the exact geometry of the measurement setup as well. The setup must 
therefore be calibrated. To perform a triangulation measurement of the 
surface, an intrinsic and extrinsic calibration of both the camera and the 
reference monitor is necessary. The intrinsic calibration of the camera 
allows a calculation of the vision rays of the camera. Since light field 
cameras have a more complex optical structure than standard cameras 
and since deflectometry requires a highly accurate calibration, it is diffi- 
cult to describe the light field camera sufficiently accurate using only a 
low-dimensional camera model. Therefore, the calibration of the light 
field camera in this thesis is done by adopting a generic camera model, in 
which the vision rays belonging to each pixel are estimated individually, 
thereby achieving a high precision calibration. As the main contribu- 
tion of this chapter, an approach is presented that performs the generic 
calibration via an alternating optimization of the ray parameters and 
the unknown poses of a reference monitor. In addition, the positional 
uncertainty of the reference coordinates, which is obtained using the 
phase-shift coding, is taken into account in the optimization. In this con- 
text, the explicit intrinsic calibration of the monitor allows calculating the 
3D coordinate of an observed monitor feature using the registration data. 
This improves the overall calibration result, due to possible deformations 
of the display being taken into account and the refraction on the front 
glass being compensated for. However, the coordinates are then still 
specified in the local coordinate system of the monitor. Only an extrinsic 
calibration of the whole measurement setup finally allows obtaining 
transformation parameters that connect the monitor coordinates and the 
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camera coordinates. In other words, the entire system calibration aims at 
providing a vision ray for each camera pixel and, additionally, it allows 
determining the 3D coordinates of a feature on the monitor in the camera 
coordinate system. Ultimately, this will then be used in Ch. 7 to calculate 
the surface normals of a specular object under examination. 

In the following section, the basics of camera calibration will be ex- 
plained using the classic pinhole camera model as an example. Subse- 
quently, Sec. 5.2.1 introduces the generic camera model. Sec. 5.2 shows, 
how the model can be fitted to measurement data to estimate its parame- 
ters. In Sec. 5.3, the reference monitor is described, and it is shown how 
the monitor model and the estimation of its parameters can be integrated 
into the generic camera calibration. Finally, in Sec. 5.4, as the last part 
of the system calibration, the extrinsic calibration of the deflectometry 
measurement system is described. It returns the relative pose between 
camera and monitor. Sec. 5.5 concludes with an evaluation and analysis 
of the presented methods. 


5.1 Principles of Camera Calibration 


Probably the simplest and most widely used camera model is the pinhole 
camera model, see figure 5.1. It describes the projection of points in 3D 
space onto an image plane. The center of the projection is the origin of 
the camera coordinate system, and it is often referred to as the optical 
center. The image plane is located at a distance f from this center, and the 
line from the camera center perpendicular to the image plane is called 
the principal axis or optical axis. The point where this axis meets the 
image plane is called the principal point. In the pinhole camera model, 
a point in space with coordinates x = (x,y, z)" is mapped to a point 
(fx/z, fy/z, f)" in the image plane [72]. Here, it is still assumed that the 
origin of the image coordinates in the image plane lies in the principal 
point, which is rarely the case for real cameras. Hence, a more general 
mapping from points in 3D space to points in image space is 
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Figure 5.1 Pinhole camera model: A 3D point is perspectively projected onto an image 
plane that is placed at distance f to the origin. 


where (c,,c,)" are the local 2D coordinates of the principal point in 
the image plane. If the world points are represented in homogeneous 
coordinates, the central projection can be expressed simply as matrix 
multiplication. And more generally, if a 3D point is first transformed into 
the camera coordinate system, the complete projection equation for the 
pinhole camera model is obtained [72]: 
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(5.2) 
where s,¢ are the local coordinates in the image plane, and à = zis a 
scaling factor. The matrix K represents the intrinsic parameters and is 
called the calibration matrix of the camera (or camera matrix for short). 
f. and f, represent the focal length of the camera on the s-axis and the 
t-axis, respectively. The coordinate transformation of the 3D point is 
described by the rotation matrix R and the translation vector t, which 
are the extrinsic parameters. 
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The pinhole model does not account for lens distortion. To accurately 
represent a real camera, radial and tangential lens distortion are often 
introduced. Radial distortion is when the magnitude of the aberration de- 
pends solely on the distance of the object point from the optical axis [16], 
and tangential distortions are caused by a lens that is not parallel to the 
image plane [79]. A convenient camera model can be derived by com- 
bining the pinhole model with a distortion correction [246]. E.g., for the 
radial distortion, the observed image coordinates (3,7) can be calculated 
from the ideal coordinates (s,t) with 


= s + (kir? + kort + kar? +) (s —c,), (5.3) 
= t+ (kyr? + kor + kar? +) (t— c), (5.4) 


where r? = (s—c,)”+(t—c,)” and kı, ks, ... are the distortion coefficients. 
Estimating all the intrinsic camera parameters K, k4, kə, ... is then done 
by observing features (usually checkerboard features) on a reference tar- 
get from different positions, and by minimizing the projection error [246]. 
This is the distance in pixels between the observed 2D positions of the 
3D features on the image sensor and their projections to the sensor plane, 
which are calculated with the parametric camera model. 


5.2 Generic Camera Calibration 


Accurate optical measurement methods are becoming increasingly impor- 
tant for high-precision manufacturing. The rising demand can be satisfied 
by modern imaging systems with advanced optics. The exact geometric 
calibration of these systems is of essential importance for computer vi- 
sion and optical metrology. Most systems use perspective projection with 
a single projection center and are referred to as central cameras. They 
can often be described by low-dimensional, parametric models with few 
intrinsic parameters, e.g., the pinhole model from the previous section. 
In some applications in the field of optical metrology, more complex 
imaging systems are needed. These can often no longer be described 
by a central camera model and are in many cases non-parametric and 
non-central, e.g., multi-camera systems, catadioptric cameras, or light 
field cameras [72, 137, 156, 195]. Here, more sophisticated models are 
needed, which always have to be precisely adapted to the specific camera. 
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l; 


Figure 5.2 An imaging system guides light rays to photosensitive elements. The generic 
camera model characterizes the light rays outside the camera independently of the internal 
optics. Only a relation between the rays 1; and the corresponding pixel index i is established. 


5.2.1 The Generic Camera Model 


The disadvantage of low-dimensional models is that they have poor 
descriptive power, and in modern cameras not every pixel of the many 
millions can be perfectly described by these models. The more complex 
an imaging system is, the more difficult it becomes to model it. The more 
elaborate the optical elements are, the more challenging it becomes to 
find a mathematically correct mapping between the light of the captured 
scene and the physical sensor plane of the camera. Consequently, in recent 
years, the lack of flexibility and precision has led to the development of 
new camera models, where cameras can be described as generic imaging 
systems, which are independent of the specific camera type and allow 
high-precision calibration. 

The generic camera model was originally introduced in the works of 
Grossberg and Nayar [66, 67]. An arbitrary imaging system is modeled as 
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a non-parametric discrete black box containing photosensitive elements. 
Each pixel collects light from a bundle of rays that enter the imaging 
system, referred to as raxel, which consists of geometric ray coordinates 
and radiometric parameters. The set of all raxels builds the complete 
generic imaging model, see figure 5.2. 


5.2.2 Related Works 


The first work on the generic camera model was conducted by Grossberg 
and Nayar [66, 67]. The authors perform the calibration by measuring 
the intersection of camera rays with known reference targets: a monitor 
that is moved by a linear translation stage with known steps. To obtain 
the radiometric parameters, they control the intensity of light along the 
rays and measure the response in the image. Sturm and Ramalingam 
[191] and Ramalingam etal. [165] exclude the radiometric properties 
and propose a calibration of the generic model where the poses of the 
reference may be unknown. A closed-form solution can be obtained, 
if the same pixel sees three points of the reference object. The down- 
side of their method is that the ray distribution of the camera has to 
be known in advance. For example, different models apply when the 
imaging system is non-central or a perspective camera, and complicated 
parametrization steps are necessary. Bothe et al. [23] and Miraldo et al. 
[131] achieve pixel-wise calibration by circumventing the estimation of 
the target pose by simply tracking it using an external stereo camera 
system or an IR tracker, respectively. Bergamasco et al. [15], on the other 
hand, assume unknown poses and calibrate the camera by iteratively 
calculating the projection of the rays onto a coded calibration monitor, 
and by minimizing the resulting coding error on a pixel level. In addition, 
they estimate the reference pose using an adapted iterative closest point 
method. Miraldo and Araujo [130] reduce the number of parameters by 
fitting a spline surface onto the set of rays. Thus, they evaluate the cam- 
era on a subset of control points. Rosebrock [171] additionally includes 
the measurement uncertainty of the reference target into the calibration 
procedure by iteratively updating this spline surface. However, these 
spline-based methods only work when the imaging system is smooth, i.e., 
multi-camera systems, light field cameras, or other more complex optical 
systems are excluded and cannot be modeled using this approach. 
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Apart from calibration, the generic camera model is used in the field of 
pose estimation, structure from motion, and surface reconstruction. Guo 
et al. [68] calibrate a generic camera system using a linear translation stage 
and then aim to estimate the pose of a target object by orthographically 
projecting it onto the calibration planes to approximate the true object 
pose in an iterative manner. Kneip and Furgale [104] propose UPnP that 
generalizes the absolute pose problem to general cameras by finding a 
closed-form least-squares solution for the absolute pose. Lee et al. [108] 
mount a multi-fish-eye camera system on a robotic car platform and use 
the generic camera model to track the position while driving. Albarelli 
et al. [5] use the generic imaging model in a structured light 3D scanning 
system, where they use a generic model for both the camera and the 
projector. 


5.2.3 Alternating Minimization-Based Calibration 


The goal of the following sections is to find a flexible calibration proce- 
dure that can accurately describe the geometric properties of an arbitrary 
imaging system using the generic camera model. In the end, however, 
one does not obtain an “image”, but rather a set of rays with correspond- 
ing intensities. Still, this does not interfere with many applications in 
optical metrology, e.g., laser triangulation, profilometry, or deflectometry, 
where mostly the geometric ray properties are relevant [11, 145, 209]. 
The presented method assumes unknown poses of the calibration target 
and iteratively solves the subproblems of camera calibration and pose 
estimation without the use of an additional translation or rotation stage. 
By processing every pixel individually and updating each pose one ata 
time, the computational costs can efficiently be reduced, whereby every 
camera ray and each observed reference point contribute to the result. 
The portion of the light that is sampled by a single pixel has a cone- 
shaped expansion due to the effects of the depth-of-field. For simplicity, 
this work models a raxel as a ray running through the center of this cone 
along the direction of light propagation. There are various possibilities 
for a mathematical description of rays, yet in this work, the concept of 
Plücker-coordinates as described in Sec. 2.3 is used. In 6D-Plücker-space a 
Pliicker-line 1 = (d",m™)" € P® is defined by its direction vector d € R? 
and its moment vector m € R? with the constraints |d| = 1,d™ = 0. 
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Calibrating the geometric properties of acamera using a generic camera 
model means that for each individual pixel its ray 1, with the direction 
vector d and the moment vector m, must be estimated. This can be done 
in the usual way: estimating all unknown parameters by minimizing an 
error function. In the traditional camera calibration approach, one has to 
minimize the projection error, as a distance between the projection of an 
observed target feature onto the sensor plane and the observing pixel of 
the same feature. However, due to the independence of the rays from the 
actual physical camera system, when considering a generic model, such 
an error measure cannot be used, because there does not exist a model for 
the sensor plane. As an alternative, the ray re-projection error should be 
minimized instead, which represents the distance between the ray and the 
observed feature in 3D space. In conclusion, ray parameters are sought 
that minimize a suitable distance measure between the camera rays and 
observed reference points, whereby the positions of the references in the 
local coordinate system are assumed to be unknown. Figure 5.3 illustrates 
the approach. The calibration can now be formulated in the sense of a 
least-squares problem by minimizing 


TR TL) Ba pA? (5.5) 


Here, the index i represents the individual rays and k depicts the index 
of the reference target pose. The metric d(-, -) is a suitable ray-to-point 
distance and p,, = R,x,; + tą are the observed features in 3D space, 
where x,, is a local point on a reference target. The matrix R, € SO(3) 
and the vector ty € R? are the corresponding transformations to the cam- 
era coordinate system. And for a compact notation, the set of rotations, 
translations and rays are defined to be 


R= {Ry,Ry,Ry,...}, (5.6) 
T= {ti to, t3,... } 7 (5.7) 
Peine (5.8) 


To present the camera calibration as a per-pixel problem and to treat 
each pixel independently of its neighbors, sufficient observations of ref- 
erence features have to be available for every pixel. However, the widely 
used checkerboard patterns can provide only sparse features which are 
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Figure 5.3 Generic calibration: The imaging system is treated as a black box that is 
independent of the internal optics described by a set of vision rays 1; . Each individual ray 
observes the intersected reference target point. The ideal calibration results in a minimal 
distance between rays and observed reference feature points. 


not nearly enough for generic camera calibration. Instead, it is a good 
idea to use active targets, e.g., flat monitor displays, and active encoding 
strategies, to assign each camera ray a 2D point in the local reference tar- 
get plane. Thus, each ray can observe one feature per pose. In this work, 
the detection of features in the reference target plane and with it the reg- 
istration of camera rays l; to monitor display points x,), is found via a tem- 
poral coding of the monitor pixels. Hence, the presented multi-frequency 
phase-shift coding with the proposed probabilistic phase unwrapping 
from Ch. 4 can be used to obtain highly accurate reference features and 
their respective point uncertainties. Of course, to use the spatio-temporal 
phase unwrapping, the mapping of reference features onto camera pixels 
needs to be sufficiently smooth. That is, the reference target has to have 
a smooth surface, and in addition, the mapping of camera pixels onto 
the corresponding ray surface needs to be a continuous function. When 
standard cameras are used, this smoothness assumption can be easily 
met. Though, for more complex camera systems, problems may arise. 
In particular, the MLA-based light field cameras that are investigated in 
this work show a strong discontinuous behavior near the edges of the 
microlenses, hence violating the smoothness assumption. Nonetheless, 
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these areas can easily be detected with the edge detection presented in 
Sec. 4.4.4, and can thus be excluded from the spatial modeling. 

With the previous results, an objective function can be defined that 
needs to be minimized to calibrate the camera and find all ray parameters 
d;, m; . Simultaneously, the estimation of the pose of the calibration 
targets R,,, t,, with respect to the camera is performed. This is done 
in a weighted-least-squares sense by minimizing the distance between 
uncertain target points x;,,, whose uncertainty is ee =, aa oo and 
their corresponding camera rays. To this end, the phase- -shift coding 
strategy is utilized to estimate the uncertainties of the reference target 
points, which results in a weighting factor w;, = o; . In conclusion, the 
objective function for the generic camera calibration is obtained: 


F(R, T,L) = So wiz |(Ruxix + tx) x d;—m,|? . (5.9) 
ik 


Regardless of the used distance measure, it is very difficult to minimize 
such a problem in a reasonable time and with the appropriate use of 
computational resources. The ray model with six parameters and two 
constraints has four degrees of freedom per pixel. Especially for today’s 
standard cameras, this leads to a huge number of ray parameters that 
have to be optimized, e.g., a 40-megapixel camera has 240 million param- 
eters. In addition, the reference target pose is in general not known. This 
means that at the same time six degrees of freedom per pose have to 
be estimated. The coupling of poses and rays and the immense number 
of parameters result in an extremely high-dimensional problem that 
cannot be solved using a single optimization method. The calculation 
of a gradient or a Hessian and the corresponding function evaluations 
would be computationally too expensive. 

Therefore, it is useful to divide the problem into subproblems and then 
solve them iteratively in the sense of an Alternating Minimization (AM) [70, 
139]. Accordingly, problem (5.9) is split into a camera calibration and 
a reference target pose estimation. The approach of an AM is to fix a 
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Figure 5.4 Generic ray estimation: Each individual ray observes several uncertain refer- 
ence features. The optimal ray has a minimal distance to all observed points. 


parameter set and solve the resulting problem. This way, one has two 
particular problems to solve in each iteration: 


|” = argmin AR), TA VR lat 610) 
1,eP$ 


Rt = ch FRA ES, 65.11) 
(R,,t,)ESE(3) 


where an appropriate initialization R®, T has to be chosen. The first 
optimization problem is solved for each pixel i individually by fixing all 
the reference target poses and the second problem is solved for every 
single pose k by assuming fixed ray parameters. This allows for solving 
the subproblems more easily. It will be shown that optimal solutions can 
be found in each iteration, which further leads to the overall alternating 
minimization converging towards a solution. 


5.2.4 Generic Ray Estimation 


One step in the camera calibration procedure is to estimate the ray param- 
eters by assuming known poses of the calibration targets. This greatly 
reduces the complexity. Instead of calculating every parameter at once, 
one can calibrate the ray 1; = (d7, m7)" € £ of each pixel individually 
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(or in parallel). Hence, for every single ray, a separate optimization prob- 
lem is obtained, as illustrated in figure 5.4. To simplify the optimization, 
the objective function is written in the more compact form 


f(d;,m,) = D Pi, x d; — m;|? 
a on ([Pix] di — m,)" ([Pixl d; — mj) 
= -X win (AF [Pir] (Pin) di + M72 [Pix]. d; + Im; |?) 
= at Aga id; tm, Ama „dt amm, il, ||” , (5.12) 


where the vector Piy = R,x;, + t, represents the reference target points 
in camera coordinates. In addition, for better readability, the index i is 
neglected in the remainder of this section. 

Since Aqq is derived from a sum of products of two mutually trans- 
posed matrices, it is always positive semidefinite. In addition, it is in- 
vertible as long as at least two different points p;, are observed. Thus, 
problem (5.12) is convex. Considering the characteristics of the Plücker- 
rays (2.14), finding the optimal rays results in minimizing a quadratic 
program with quadratic equality constraints: d'm = 0, |d|| = 1. Al- 
though the minimization of such a problem, in general, requires a difficult 
nonlinear minimization, the following presents a solution to find a global 
minimum in this specific case, using a few simple steps. 

At first, it should be obvious that the solution of the constraint problem 
is scale ambiguous and that the norm of the ray direction ||d|| does not 
influence the actual ray properties [207]. Thus, after having found a solu- 
tion, applying a normalization to the ray 1, = 1/ |d] = (d/||d||, m/|d|]) 
makes it possible to obtain a geometrical meaningful point-to-ray dis- 
tance (2.20). To deal with the equality constraints, making use of the 
method of Lagrange multipliers helps. Hence, the constraints are added 
to the objective function using the Lagrange multipliers A, p: 


g=d' Aggd + MTA mad + amm |ml?+Ad? m + u (dtd—1). (6.13) 
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Further, stationary points of this Lagrangian can be found by fulfilling 
the first order conditions for aminimum: 


Oag = 2Aggd + AT m+ Am + 2d = 0, (5.14) 
Ing = amt A, yd + Ad = 0, (5.15) 
dg=d™m=0, (5.16) 
9,9 = |al?-1= 0. (5.17) 


Using (5.15) and (5.16) results in the solution for the ray moment m and 
the multiplier X: 


1 
eae |. Id, 1 
m das, ( md + A ) (5 8) 
dm =—— aT (A, +A d=0, (5.19) 
Amm 
dA nad (5.17) T T T 
>A= da ` d Angd = -d > 2wix [Pik] | d 
=-d™ ((), 2w.Pi,) x d) = —d" (p xd) =0, (5.20) 


where the last equation holds because d is orthogonal to p x d, Vp € R°. 
Inserting these results into (5.14) leads to a simple eigenvalue problem 
for the solution of the ray direction d and the Lagrange multiplier pu: 


(Au = AT Ama aad (5.21) 

This equation still contains the trivial solution d = m = 0 which 
however has no geometric meaning for the calibration and is excluded 
by (5.17). Apart from that, the solution space of (5.21) consists of three 
eigenvalues u; with corresponding eigenvectors d;. After estimating a 
possible d; and corresponding Lagrange multiplier ju; , it is necessary 
to scale the eigenvalue problem in order to normalize the ray such that 
||d,|| = 1. This preserves the geometric meaning of (2.20) and allows 
obtaining an unambiguous scaling. Further, (5.18) provides the corre- 
sponding ray momentum m,. And finally, from these at most three 


j 
possible stationary points, the one with the smallest objective function 
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Figure 5.5 Generic pose estimation: The set of all rays observe features on the calibration 
reference. The optimal pose estimation results in a minimal distance between the rays and 
corresponding feature points. 


value (5.13) is selected to be the optimal solution. In conclusion, one finds 
a closed-form solution for the least-squares problem of the weighted ray- 
to-point distance minimization. 


5.2.5 Generic Pose Estimation 


As before, the estimation of the calibration target pose can drastically be 
simplified by assuming known ray parameters. Therefore, it becomes 
possible to optimize each pose individually, as illustrated in figure 5.5. 
The objective function for each pose k becomes: 


f(Ry, tp) = > Wip (RX + ty) x d; — m;|? . (5.22) 


However, solving for a pose R, t,, is non-trivial because the solution 
space is restricted to the special Euclidean group SE(3) , which combines 
rotations and translations in three dimensions, R, € SO(3) and t, € RÌ, 
respectively. Directly applying a nonlinear optimization procedure is not 
advisable, because every function evaluation results in the summation 
over all rays and is thus computationally very expensive. Therefore, as 
before, a more compact form of this quadratic function is necessary to 
reduce the computational effort. 
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Again for the sake of brevity, the index k is omitted for the remain- 
der of this section. For further simplification, the vectorization operator 
r = vec(R) € R° stacks the columns of the 3x 3 matrix R. By computing 
the summation over all ray indices i only once, reordering, and extracting 
the pose parameters, the objective function can be formulated indepen- 
dently of the actual number of rays, which simplifies and speeds up the 
following optimization steps (see Appx. 9.1.1 for details): 


Rt) =r’ A r+ t Ayt +t  Ayrtbir+bit+hA 


(5.23) 
subject to r = vec(R), (R,t) € SE(3). 


While observing the constraint quadratic objective (5.23), one may notice 
that the main constraint lies in the rotational part and the objective is 
also convex in the translational part. Thus, the problem can further be 
reduced by decoupling translation and rotation, which means that t can 
be expressed in terms of R. The first order condition for a minimum 
ô, f (R, t) = 0 leads to the optimal translation vector 


H= -JAR (Aur+b,). (5.24) 


Inserting (5.24) into (5.23) results in the decoupling of the rotation and 
translation subproblem, which then again yields a new quadratic opti- 
mization problem (see Appx. 9.1.1): 


FR) =r'Ar+b'r+c, subject to r = vec(R), Re SO(3). (5.25) 


After finding a solution for the rotation matrix, the optimal translation 
vector is derived from (5.24), assuming invertibility of A,,. As shown 
in Appx. 9.1.3, the matrix A,, is positive definite in most cases, except 
for a few exotic camera ray distributions, e.g., parallel rays, telecentric 
optics. Hence, the equation truly finds the minimum of the objective with 
respect to the translation. 

Although minimization of (5.25) seems simple at first, the optimiza- 
tion has the constraint to find a solution in SO(3). This is equivalent 
to a non-convex problem with quadratic and cubic constraints on the 
rotation parameters, cf. Sec. 2.2. For solving this, there exist various ap- 
proaches in the literature. Bergamasco et al. [15] use an iterative closest 
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point algorithm that iteratively calculates the transformation from the 
observed points to the closest point on the corresponding rays, which 
however only converges near the optimum. Kanatani [101] suggests a 
fast method by first calculating a Euclidean solution by first assuming 
R € R?” and then projecting the solution onto the SO(3)-manifold 
using the singular value decomposition, which results in a not entirely 
correct minimization. 

However, since the main focus of this work is not real-time optimiza- 
tion, but rather a highly precise pose estimation, there is the obligation to 
find an accurate minimum to ensure convergence of the AM calibration. 
Therefore, a gradient-based optimization approach on the Riemannian 
manifold SO(3) with tangent space so(3) is applied, cf. Sec. 2.2. The 
tangent space to the Lie group SO(3) is its Lie algebra so(3), which 
consists of all skew-symmetric 3 x 3 matrices. The mapping from any 
element [€],, € s0(3) to R € SO(3) is called the exponential map R = 
Exp((€],.) = e§!«, and the reverse map is called the logarithmic map 
[£], = Log(R). Both can be calculated in closed form using the well 
known Rodrigues rotation formulas (2.12), (2.13). Therefore, in a local 
neighborhood gp (€) = Exp([£],) R one can find a parametrization of 
the manifold in the tangent space. A function defined on the manifold 
can thus be described locally by Euclidean coordinates £ € R? : 


fogo[], : R? + s0(3) > SO(3) 3 R, 
fe (R) := F(gr(8)) = f(Exp((g],.) R) . 


If a function is to be optimized on the manifold, the corresponding 
direction of descent must be sought in the local tangent space f;(R). 
To use conventional optimization methods, a valid representation for 
both the gradient and the Hessian must be identified. According to Absil 
etal. [1], these can be easily found by using directional derivatives of the 
locally parameterized manifold in the direction of the tangent space: 


Dg FR) = -feg (R)| o = €"grad(f), 
De grad(f) = 32, fag (R)|_, = E’Hess(N)E. (5.28) 


(5.26) 
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Looking back at the original problem (5.25), this approach leads to the 
explicit formulas for the Riemannian gradient and Riemannian Hessian 
(see Appx. 9.1.2 for a detailed derivation of the operators): 


grad(f) = 2ZT (R & I) (Ar + b), (5.29) 
Hess( f) = 22" ((R @1) A(R @I)’ —1@mat(Ar+b)R")Z, (5.30) 


with Z = [vec([e1],) , vec([e2],) , vec([e3],)] € R°% , the unit base vec- 
tors e, , €23, e,, and the identity matrix I. The reshape operator mat(-) 
is the inverse of the vectorization operator vec(-), and ® represents the 
Kronecker product. 

After the formulas for the gradient and the Hessian have been estab- 
lished, a quadratic model of the local tangent space then enables to mini- 
mize the objective (5.25) with the help of an appropriate Newton descend 
algorithm. Apart from minor differences, the procedure is quite similar to 
the classic Euclidean approach [24]. For the current iteration, grad f (R) 
and Hessf(R”’) are calculated. After the search direction €) has been 
found by solving the Newton equation, one has to calculate a projection 
of the tangent space back to the manifold to obtain a valid descend: 


Hess f(R)) €”) = —gradf(R™), (5.31) 
RO") = Exp(a [&”] Re. (5.32) 


Finally, a subsequent 1D backtracking line search in SO(3) finds a suf- 
ficient step size a and accelerates the convergence [140]. Figure 5.6 visu- 
alizes the procedure. In order to initialize the algorithm, an appropriate 
start is required, where in the context of an AM-camera-calibration, the 
pose estimate from the previous iteration may be used. 

Looking back at the original camera pose optimization (5.23), we see 
that the pose has to be found in the special Euclidean group SE(3). 
Optimization on this manifold is not straightforward, but the problem 
can be simplified by making use of the local diffeomorphism between the 
manifolds SE(3) and SO(3) x R? . If there is a (local) minimum in SE(3) , 
then the same minimum exists in SO(3) x R° [189]. Having this in mind, 
the presented optimization performs two steps: first optimization in 
SO(3) , using the manifold Newton descend; and afterward optimization 
in R®, using (5.24). Performing the optimization in this manner might 
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so(3) 


Figure 5.6 Local parametrization of SO(3)-manifold through its tangent space so(3). 
The search direction is found in the tangent space and projected back onto the manifold to 
find a minimum. 


be less efficient in terms of iterations, but it yields the same optimization 
result while avoiding the more complex calculation in the se(3) tangent 
space, which has a greatly different exponential and logarithmic map. 


5.2.6 Convergence, Acceleration and Initialization 


Depending on the current pose estimation, the camera ray calibration 
provides the globally optimal solution in every step. Furthermore, the 
pose estimation converges towards a minimum and provides no inferior 
result than the previous iteration. Following the research in the field of 
AM [65, 70], it is easy to show the convergence of the optimization proce- 
dure to a stationary point with an O(1/n) convergence rate. To obtain a 
faster convergence, acceleration techniques may be applied. Therefore, 
Nesterov’s acceleration scheme is modified to obtain an almost O(1/n?) 
convergence rate [62, 134]. The basic principle of this acceleration is that 
the difference between the new estimate and the old estimate is weighted 
and added to the new estimate in each iteration, where the weighting 
factor is a monotonically increasing sequence. However, these algorithms 
cannot be applied to the manifold optimization problems presented here 
without any adaptation. Hence, during the acceleration step, a weighted 
rate of the change of the pose parameters is added to the next estimate. 
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Algorithm 2 Accelerated Alternating Minimization 


Input: For every pixel i and target pose k : measure monitor coordinates 
x;, and weight w; 
Output: Calibrated ray 1; for each pixel and pose R,,, t, of all references 
Initialize: Set poses of reference targets R® 
TO ‚set acceleration parameter ag = 1 
1: for n = 1,2,3,... do 
2: for i =1,2,3,... do 


3: Hold pose parameters and optimize rays 
4: ie = arg min f (R®, re L) 
1,EP° 
5: end for 
6: for k=1,2,3,... do 
7: Hold ray parameters and optimize poses 
8: R;,t,= argmin Ru tp EPTO )) 
(R,.t,)EeSE(3 
9: Update a rate 
2 
io. pe oe 
11: Accelerate translation and rotation update 
12: et (t; tt) 
13: Ry") = Exp( = 1Log( R; R4) ) Rj 
14: end for 
15: end for 


When accelerating the rotation, of course, this has to be done on the 
SO(3)-manifold: The current rotation is reversed by the previous rota- 
tion, projected onto the so(3) tangent space using the Log-map, weighted 
by an acceleration parameter, and finally transformed back into a rota- 
tion matrix using the Exp-map and multiplied onto the current estimate. 
Algorithm 2 summarizes the complete accelerated AM calibration. 
Although this is a strictly convergent algorithm, obviously no unique 
solution exists. Depending on the initialization, the optimization runs 
into an arbitrary coordinate system. Therefore, it is advisable to initial- 
ize the algorithm with a rough estimate of the reference target poses, 


101 


5 System Calibration 


which could for example be obtained using standard model-based ap- 
proaches presented in the literature [26, 246] or the generic approach 
by Ramalingam etal. [165]. However, here it is of utmost importance 
that the camera model is properly chosen. Alternatively, of course, one 
can also randomly select starting poses with the downside of a longer 
optimization time and the increased risk to converge to a non-optimal 
local minimum. Nonetheless, the arbitrary coordinate system poses no 
problem, since it does not change the geometric properties of the rays, 
and accordingly, the calibrated camera can be used without loss of ac- 
curacy. Even more, the final calibration can be easily transformed into a 
standardized coordinate system. 


5.2.7 Normalizing the Ray Bundle 


Due to the black box character of the generic calibration, it is initially 
not possible to define a consistent camera coordinate system for every 
calibrated camera. Even when using the same calibration algorithm for 
the same camera, the outcome can vary. Hence, the result of a generic 
calibration is in general not unique. That is, the calibrated camera rays are 
represented in an arbitrary coordinate system, which usually depends 
on the starting configuration of the generic calibration procedure or the 
used calibration reference target. Therefore, to transform this arbitrary 
coordinate system into one that is fixed to the individual camera, a few 
steps are necessary. 

First, the origin of the camera coordinate system is defined to be the 
optical center of the camera. For central cameras or nearly-central cameras, 
e.g., light field cameras, this corresponds approximately to the center of 
the exit pupil. Its location can be understood as the point p, that has the 
smallest distance to all rays, i.e., it can be calculated by minimizing the 
weighted mean of the Euclidean distances to all rays: 


Po = arg min X` w; |p x d; — ml? . (5.33) 
P i 


The weighting factor w; can be chosen to suppress poorly calibrated rays 
and to remove outliers. For instance, a simple choice is to use the inverse 
of the mean ray re-projection error 


E; = 5 Wip [Pix x d; — m; ||? (5.34) 
k 
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that can be calculated during the generic calibration procedure for each 
ray. This results in 


Po = (x= w; Idi], a?) 5 3 w; [d;]; m; . (5.35) 


As a next step, the z-axis of the camera-fixed coordinate system defines 
the view axis as the average ray direction which can be found by solving 
the constrained optimization problem 


d, = arg max X` w; (d, d;}? , subject to |d|| = 1. (5.36) 
d i 


Using the Lagrange multiplier formalism and solving for d produces an 
eigenvalue problem: 


d, = arg max X` w; (d,d,)? — (dd), (5.37) 
d i 


> by waa?) d, = ud,, (5.38) 


where the eigenvector d, with largest absolute eigenvalue u results in the 
average ray direction. A corresponding rotation matrix, which rotates 
the bundle of rays from the old z-axis e, into the new z-direction, can 
then directly be calculated using the Rodrigues formula (2.12): 


R, = Exp(arccos (d7 e,) (d, x e,))- (5.39) 


The last remaining degree of freedom is the rotation around this new 
z-axis. Since the cameras that are studied in this work (standard cameras 
and light field cameras) project the light onto a rectangular sensor, it is 
useful to align the coordinate system’s x- and y-axis with the correspond- 
ing sensor’s s- and t-axis, respectively. Furthermore, due to the almost 
perspective projection, the change of ray direction with respect to the x- 
and y-axis should correspond to the change with respect to the s- and 
t-axis. Thus, using d; = (d, i, dyi du); the rotation angle that aligns 
both coordinate systems can be found by calculating the mean image 
gradients with respect to u = (s,t)": 


Ga WiVude i (2°) 2 >; Wi V ady i 


d z D Wy d > Ww; = 


at yt 
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By estimating the orientation angle of the gradients with respect to the 
sensor axes, a rotation matrix can be found that rotates the coordinate 
system around the z-axis by an angle a: 


a, = arctan2 (d a, = arctan2 (dys: dy.) + fe, (5.41) 


5) dyt) 7 


ys? “yt 2 
a = arctan? (sin a, + sin a}, COs Q, + cosa,), (5.42) 
cos(a) —sin(a) 0 
R, = | sin(a) cos(a) 0 (5.43) 
0 0 1 


While this gradient-based approach works well for camera systems whose 
ray surface is a smooth function, problems arise with discontinuities. For 
light field cameras, the ray direction switches to the opposite direction 
at the edges of the microlenses. As a result, the gradient shows a strong 
tendency to the opposite direction, which would lead to a corrupted 
orientation estimation. However, these too strong gradients can easily 
be suppressed by means of a threshold value in (5.40), with w; = 0 for 
|Vud..i| > Tin, - And in addition, the weight factor w; is very small near 
the microlens edges, due to the higher calibration error £, that is caused 
by the overall worse quality of the optics. And hence, these values are 
strongly suppressed nonetheless. 

After all normalization parameters are found, as the final act, shifting 
the origin and appropriately rotating the axes transforms the Plücker-ray 
parameters into the camera-fixed coordinate system. And thus, each ray 
1, = (dT, m)" is transformed into the new normalized representation: 


1 - Tl, (5.44) 


i,norm 


with the ray transformation matrix T that consists of a ray rotation ma- 
trix (2.17) and a ray translation matrix (2.18): 


R,R, 0 ) 


r= a ~ T Pale R.R, 


(5.45) 


5.3 Calibration of the Reference Target 


Besides the camera, also the reference target plays an important role in 
camera calibration and deflectometry. The commonly used checkerboard 
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calibration patterns are often printed on paper and then glued to a solid 
base of wood or cardboard. However, due to this rudimentary construc- 
tion, the reference target can no longer be assumed to be absolutely flat. 
This means that the solid base material might be bent and, in addition, 
small bumps on the paper can locally affect the planarity. The use of 
monitor screens as reference targets drastically reduces this problem, 
because the pixel plane has a very high local planarity due to the precise 
manufacturing process. 

As already mentioned, the calibration method presented in this work 
requires dense features on a reference target, which is why it is recom- 
mended to use a monitor as a reference. Nevertheless, monitor screens 
are not ideal reference targets either. Depending on how they are set up, 
they can deviate from their ideally planar shape to a greater or lesser 
extent. Therefore, if this deviation is not sufficiently taken into account, 
it can lead to a non-ideal calibration. Also, apart from the calibration 
aspect, if the monitor is placed in a deflectometric measurement setup 
and, for example, is mounted over the measurement sample, it may 
show considerable curvature. To prevent this from leading to erroneous 
measurements, it is therefore imperative that the calibration target is 
described by appropriate modeling. The modeling of the monitor in 
this work can be grouped into three sub-aspects, i.e., the modeling of 
the nonlinear characteristic of the pixel brightness, the modeling of the 
refraction at the front glass, and the modeling of the screen shape. As 
briefly mentioned in Sec. 4.1.1, while the brightness characteristic only 
influences the quality of the registration and can easily be compensated, 
the two remaining aspects systematically and directly influence the value 
of the measured coordinates. The coding methods from Ch. 4 explain 
how a subpixel position in the monitor plane can be assigned to each 
camera ray employing active illumination. However, only the z- and 
y-coordinate of the reference point can be determined. So far, it was not 
specified how the z-coordinate, i.e., the height, can be obtained, or it was 
implicitly assumed that it is set to zero for a flat monitor. 

The following sections deal with the modeling of the reference target, 
the estimation of the model parameters, and finally the integration of 
the reference model into the camera calibration framework. 
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5.3.1 Reference Surface Model 


There are several ways to model the non-ideality of the monitor. Berga- 
masco etal. [15] extend their camera calibration algorithm by modeling 
the influence of refraction at the front glass. They adjust the local mon- 
itor coordinates x, y using an additive offset that is calculated using 
the angle between the ray direction and monitor surface normal. The 
parameters of the refraction model are predefined and used to improve 
the camera calibration. Schmalz et al. [182] and Chen et al. [36] on the 
other hand, take refraction into account by correcting the z-component 
of a measured point, whereas the x, y-coordinates remain unchanged. 
Maestro-Watson et al. [127] model the refraction in a similar way, how- 
ever, they confirm that the monitor surface also deforms the cover glass. 
Hence, they measure the surface using a coordinate measuring machine 
to obtain better surface normals for the refraction calculation. Studies 
by Schmalz etal. [182] and Bergamasco et al. [15] show that modeling 
the refraction has only a small impact on applications such as camera 
calibration or deflectometry. And as investigated by Nüss et al. [141], the 
shape of the monitor has a far greater influence. Bartsch etal. [13] model 
the monitor by representing its surface with a polynomial surface and 
the model parameters are estimated during the calibration of a deflecto- 
metric measurement system. To combine both non-idealities, Reh et al. 
[168] model the z-coordinate of the monitor as an additive superposition 
of both effects, that is, shape and refraction. 


5.3.1.1 Shape Model 


Commercially available monitor screens are locally very planar and only 
deviate globally from the ideal plane, which can be perceived as a slight 
curvature or torsion. Thus, as suggested by Reh et al. [168] and Bartsch 
et al. [13], the z-coordinate of the reference points, i.e., the monitor height, 
is defined using a bivariate polynomial function 


N, Ny 
Zg(@,y) = 5 5 Cono y”; (5.46) 


m=0 n=0 


where N, and Ny are the highest orders of the variables x and y, respec- 
tively. And the constants c,,,,, represent the coefficients of the correspond- 
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Figure 5.7 Refraction of rays at the cover glass as proposed by Reh et al. [168]. 


ing polynomial components. As shown by Varsamis et al. [210], to get 
a short expression of the bivariate polynomial function and to simplify 
further calculations, (5.46) is converted into a vector representation: 


m(x, y) = Later siege * & [Ly y”, y] , (5.47) 
6. [coo; C109 +++ CN 0> COL? C119 sen, CN N, | , (5.48) 
> zglx, y) = m(z,y)’c. (5.49) 


5.3.1.2 Refraction Model 


For the modeling of the refraction at the front glass cover, the model of 
Reh et al. [168] and Chen et al. [36] shall serve as a reference. The refraction 
in the cover glass causes the measured monitor coordinates to appear in 
a slightly closer position, which depends on the angle of incidence of the 
camera rays. From figure 5.7 follows h tan (3) = g tan (a) and by using 
the law of refraction sin (a) = n sin (8), where n is the refraction index 
of the glass, it follows 


-f u tan (8) cos (a) 
u JENT tan(a) ms Aen 
ny/1— sin? (8) 
=} (: = ee) (5.50) 
vn? — 1 + cos? (a) 
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To calculate the refraction, the angle a between a camera ray and the 
surface normal at the observed monitor point must be determined. For 
the unit normal vector n and a ray with direction vector d follows 


cos (a) = d’Rn, (5.51) 


where R transforms the monitor coordinate system into the camera coor- 
dinate system. While Reh et al. [168], for simplicity, consider the refraction 
model to be completely independent of the shape model, because they 
use only a very small screen with a diagonal of about 2 cm length, this 
simplification does not hold in this work. Since commercially available 
monitors usually have a diagonal of more than 50 cm length, a deforma- 
tion of the monitor also causes a deformation of the front glass. Therefore, 
according to Maestro-Watson et al. [127], the normal of the front glass 
should be calculated using the shape model. It follows: 


—0,2 nu eo Lge mn M min 
n(z,y,c) = | —Oy2g | = no no mn T ae - 6.52) 
1 1 


This leads to the expression for the height deviation caused by the 
refraction in the front glass 


d’Rä(z, y,c) 


Zp(t,y)=h] 1 (5.53) 


n? — 1 + (dTRä(z, y, c))? 


5.3.1.3 Complete Reference Model 


As suggested by Reh etal. [168], to obtain the reference model, both the 
refraction model and the shape model are combined. Given the direction 
of a camera ray d; , the point coordinates x,,, y;, that were estimated 
using phase-shift coding, and the rotation of the reference target R, , the 
value of the z-coordinate can be calculated. Finally, the complete monitor 
model is represented by ze (Tik Yik) = Zs (Zik: Yik) — ZR (Zik Yin) » Where 
the z-value of the refraction is subtracted, since the refraction causes the 
measured monitor coordinates to appear in a slightly closer position. 
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Using the abbreviations m,, := m(z;,, Yin), Aile) = RyN(x iz, Yik ©), 
this results in 


Mr Vik 
ik Yik 
x = = 
ar Yik T h 1 d’n;.(c) 
Zo (igs Vir) m,e — re ae Hanne)? 
= q ik 


5.3.2 Parameter Estimation 


In order to estimate the parameters c, h and n of the reference model, 
the newly modeled z-coordinate has to be integrated into the objective 
function (5.9): 


2 


Tik 
FCR, T,£L,c,h,n) = Yu R, Yik +t; x d; — m, 
an 20 (Tik: Yik) 


(5.55) 
Given that the modeling of the front glass results in a strongly nonlinear 
equation, (5.55) cannot be simplified as demonstrated in the previous sec- 
tions. Nonetheless, because the monitor model consists of relatively few 
parameters, it can be optimized using standard gradient descent-based 
methods (Levenberg-Marquard, BFGS, etc. [140]). To ensure the stability 
of the optimization, the front glass parameters must have constraints to 
avoid physically unreasonable solutions. The optimal monitor param- 
eters can then be found by solving the following bound-constrained 
optimization problem: 


argmin f(R,T,£L,c,h,n), subjecttol <n,0 <h. (5.56) 
enh 

Since a gradient descent-based optimization is an iterative process, the 
objective function must be evaluated at least once in each iteration. This 
leads to the fact that the sum over all rays i and all poses k has to be 
recalculated very often, which may take several seconds even with an 
efficient implementation on current GPU hardware. If the monitor op- 
timization is now to be integrated into the generic calibration, the total 
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optimization time will increase dramatically. However, investigations by 
Schmalz et al. [182] could show that the front glass only has a very small 
influence on the calibration. It is therefore advisable to estimate only the 
shape of the monitor and to rely on the manual of the used monitor to 
obtain the parameters h,n of the cover glass. 

If the optimization of the glass cover is omitted, the objective func- 
tion can be rearranged in a way that the summation over all poses and 
rays only needs to be evaluated once during the optimization. This 
results in a very fast optimization. With the help of the abbreviation 
aj, = [d;]E (tintin + Yikr2k + tp) — m; , by using the column vectors of 
the rotation matrices Ry = [r1k; Po4,13x], and by assuming zg ;, = 0, 
the optimization problem (5.55) can be expressed as 


f(R,T, L,c) 


= > Wir [A]] (Lifir + Viktor + mer, + ty) — m;|? 
uk 


T T 2 

= > Wix|l[di] ram, € + a; 

ik 
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=c'Qc+q'c+o. (5.57) 


An easy-to-find minimum of the above objective function can be ob- 
tained assuming that the matrix Q is positive definite. If so, the optimal 
parameter vector of the reference model can be straightforwardly inferred 
without using further optimization steps: 


c=2Q"'q. (5.58) 


Since the matrix Q consists of the sum of squares of H,, , it is positive 
semidefinite. And due to the objective function being quadratic, a global 
minimum is obtained. The degenerate case with det (Q) = 0 occurs in 
reality only if H,, = rs, [d;], m}, = 0 holds for all summands. This 
means that all camera rays d, would have to be orthogonal to the z-axis 
r3, of all the reference coordinate systems, i.e., a telecentric camera would 
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always need to look exactly frontally at the monitor. Because this case 
can only be achieved for very special imaging configurations, it will not 
be considered further in this work. 

As the estimation of the reference model parameters not only returns 
the shape of the monitor but also helps to improve the camera calibra- 
tion, it can be easily integrated into the overall optimization framework. 
Thus, it is only necessary to calculate the value of the z-coordinate of 
the reference points using the current reference model (5.54). This is 
then used in each step of the ray estimation from Sec. 5.2.4 and the pose 
estimation from Sec. 5.2.5. The alternating minimization for the generic 
camera calibration can then be extended to a three-step optimization: 


L”= argmin (RO), TY, Lc) vieZ, (6.59) 


L,ep® 
c™= argmin f(R™,T™,£L,c) , (5.60) 
ceR\zNy 
Re 4” = argmin f (Ry, t,,£™,e) ‚VkeK, (5.61) 
(R,,,t;,)€SE(3) 


where the reference target model can be initialized as a flat screen using 
c = 0. Of course, in order to obtain the complete reference model, the 
influence of the cover glass on the measured reference coordinates and 
its parametrization may be included in the overall calibration. 


5.4 Calibration of the Deflectometry Setup 


While the camera calibration provides a determination of the vision rays 
and the calibration of the reference target allows modeling of the refer- 
ence features, for the deflectometric reconstruction of specular surfaces 
another calibration is necessary: The transformation between camera and 
monitor coordinates has to be identified to transform the local monitor 
features into the global camera coordinate system. Here, the assumption 
is made that the camera and the reference monitor do not move relative 
to each other so that there is only one transformation. A problem that 
arises here is that in the deflectometric measurement setup the monitor 
is generally not in the direct field of view of the camera, since it should 
only be observed as a reflection on the surface under test. A monitor pose 
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estimation, as presented in Sec. 5.2.5, does not work here without modifi- 
cation. To estimate the real relative transformation between camera and 
monitor, one can observe the monitor via the reflection in a reference 
mirror. If the position and shape of this mirror are known, the virtual 
(mirrored) monitor pose can be used to determine the true transforma- 
tion between camera and monitor. In general, however, the position of 
the mirror is unknown. There are various approaches to solving this dif- 
ficulty. As probably the most intuitive approach, markers can be placed 
on the mirror, which allows a direct pose estimation of the mirror [6, 29]. 
If no markers can be placed on the mirror or if it is not desired that the 
markers increase the measurement uncertainty, then the mirror pose and 
the monitor pose can also be calculated indirectly. For this purpose, the 
mirror is not only placed in one position but in several positions, and the 
virtual monitor pose is measured each time. The set of virtual poses can 
then be used to infer the original pose [196, 229, 231]. 

In this work, the monitor pose is found using a marker-less plane 
mirror. The following sections explain how the set of virtual poses can 
be used to obtain a linear solution for the pose. Then, it is described how 
the generic pose estimation from the previous sections can be used to 
further improve the linear solution. 


5.4.1 Linear Solution 


The problem of the deflectometric calibration is shown in figure 5.8. 
Because the camera does not see the monitor directly but only its reflec- 
tion, the monitor coordinates x are first transformed into the camera 
coordinate system and then reflected at the mirror plane. The virtual 
coordinates x can then be calculated with 


x H 2dn\ (R t\(x\)_ (R #£\ (x 
ENDEN) 6 
where n is the unit normal of the mirror, dis the shortest distance between 


the mirror plane and the camera aperture, and H = I — 2nnt represents 
a reflection operator. Depending on the mirror position, the relation 
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Mirror 


Camera 


Figure 5.8 Mirror-based pose estimation: The camera sees the reflection of the monitor 
in the reference mirror. Only the transformation of virtual monitor coordinates to camera 
coordinates can be estimated. 


between the different virtual poses of the reflected monitor and the pose 
of the true monitor can be directly derived: 


R, =H,R, (5.63) 

t, = H,t + 2d,n,. (5.64) 

To obtain a solvable equation system, the angle of the mirror must be 
changed for each acquisition. Takahashi et al. [196] show that the equa- 
tions can be solved using an orthogonality constraint if at least three 


mirror positions are observed. For this, the intersection line m; j between 
all possible mirror pairs i, j € {1,2,3} is defined. Since the intersecting 
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lines have to be orthogonal to the respective mirror normals, Xiao et al. 
[229] derive the equation 


‘Rim, = (HR) (H,R)" m,, = H,;RR™H}m,; 
= (I — 2n;n7) (1 — 2n;n7) m,;; 


=m, (5.65) 


ji 
The intersection line m, ; can be found as unit eigenvector with the small- 
est eigenvalue of the matrix RRT — I. By using the intersection lines, 


the unit normal vectors of the reference mirror planes can be calculated 


Mio X Mj3 m]. X Mo3 mı3 X Mo3 


n, = ——— „my = —— ,n, = — —... (5.66) 
Im}. x msll [m2 x mgs |m; x məl 


For more than three poses, the normal estimate can also be averaged [197]: 
M!n,=0 with M, = (m,,,m,,mj,3,...), (5.67) 


where the normal vector is found to be the eigenvector with the smallest 
eigenvalue of the matrix M;M7 . Then, using (5.63) and H,H, = I,a 
rotation matrix can be calculated for each mirror pose R, = H; R, . In the 
ideal case, all estimates should give the same result. Though, in order to 
suppress noise, rotation averaging [73] is applied and the mean rotation 
matrix is calculated using a singular value decomposition: 


R=) R; > R=USV > R=UV. (5.68) 


Finally, by using (5.64), the remaining translation vector and mirror dis- 
tances can easily be found by solving a system of linear equations 


H 2n, 0 o1l,; i, 
H 2 1 t 
2 0 2m A (5.69) 
Hy 0 0 anal] ty 
dy 


Thus, given virtual pose parameters R, , t, , the true pose of the monitor 
R, t can be obtained with a closed-form solution. 
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To find these virtual pose parameters, in principle, standard PnP meth- 
ods can be used [109, 133]. However, these usually work with a classical 
camera model, e.g., the model described in Sec. 5.1. Because this work 
does not commit to a specific camera model, the generic pose estimation 
from Sec. 5.2.5 will be used. In the context of a generic camera model, 
for each camera ray, the distances to the virtual monitor points are mini- 
mized: 


(Ry, t,) = Do wind Xir l;) 2m 


subject to R, € 0(3)/SO(3) 


2 


(Rıxır T ta) x d; m, 


(5.70) 
Due to the reflection at the mirror plane, the virtual rotation matrices are 
now orthogonal matrices with det (R) = —1. Fortunately, O(3)/SO(3) 
and SO(3) have the same Lie algebra so(3) . Therefore, the previously de- 
scribed optimization method can be used and only the initialization has to 
be adapted. For this purpose, R, € 0(3)/SO(3) must hold. Since no other 
information shall be specified, it shall be assumed that the reference mir- 
ror is approximately orthogonal to the camera view axis, the LCD surface 
of the monitor points approximately in the same direction as the camera, 
and the y-axes of both coordinate systems are approximately collinear. In 
other words, the monitor coordinate system is rotated by approximately 
180° around the y-axis, see figure 5.8. A simple initialization for the virtual 
rotation matrix is now obtained by defining the virtual pose as a reflection 
on the y, z-plane R, = I— 2n,n, with n, = (1,0, 0)7 . Starting from this, 
the generic pose estimation can converge sufficiently fast to a solution. 


5.4.2 Nonlinear Optimization 


The linear solution is usually sensitive to noise, so it is only used as an 
initialization for a subsequent optimization to refine the monitor pose 
R, t and the positions of the mirror n, , dą simultaneously [229]. To take 
advantage of the generic camera model, it is advisable to minimize the 
distance between the observed monitor points and the reflected rays 
to obtain the optimal transformation between the camera and monitor 
coordinate system and, in addition, to obtain the mirror pose. Hence, to 
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simplify the optimization, the mirror pose parameters are combined into 
one vector [227]: 


T 
VkV k 


(5.71) 


Vy = dny H, = 


Iv.1? 


Since in the deflectometric test setup the monitor is often suspended 
above the test sample, it shows a non-negligible curvature due to gravity. 
Therefore, it makes sense to estimate the monitor parameters c at the 
same time. The minimization of the following distance measure provides 
the desired parameters 


F(R, t, vi, V2,.-.,¢) = So wi |(H,(Rx;,(c) +t) — 2v,,) x d; — m;|? . 
ik 


(5.72) 
Since this objective function is highly nonlinear, but also contains only a 
few parameters, it can be minimized using standard optimization meth- 
ods, e.g., BFGS or Levenberg-Marquardt [140]. 

The calibration result could be further improved by performing a 
holistic calibration and by including the calibration of the camera rays 
in the optimization [6]. However, there is the problem that the reference 
mirror must be very accurate and very planar. Because of this, sufficiently 
large mirrors are not available or are extremely expensive (e.g., a highly 
planar mirror with only 10cm? diameter already costs about 800 €). If a 
standard model-based camera calibration is used, this is not a big concern, 
because not every ray necessarily has to observe a reference feature. For 
generic methods, however, it must be ensured that each ray can observe 
enough points in the monitor plane. When only a small mirror is available, 
the calibration procedure is time-consuming as the mirror has to be 
placed in several positions. Therefore, a holistic optimization will not be 
considered further in this work and is only given as a brief outlook. 


5.5 Evaluation 
The following sections examine the steps necessary for system calibra- 


tion and analyze the presented procedures. For the evaluation, a 27” 
monitor with a resolution of 2560 x 1440 px and a pixel pitch of 233 um 


116 


5.5 Evaluation 


was used to display the necessary calibration patterns. Two different 
imaging systems were used to evaluate the proposed generic camera 
calibration: A standard webcam (Logitech C920 HD Pro Webcam) and 
a microlens array-based light field camera (Lytro Illum). The first rep- 
resents a central camera that can be modeled with the classical pin- 
hole camera approach, whereas the second camera ultimately results 
in a non-central camera with multiple projection centers, which in ad- 
dition requires a much more complex camera model to be efficiently 
calibrated. The monitor was captured from 20 different poses, whereby 
several phase-shift patterns have to be recorded at each pose to encode 
the target features. The phase-shifting was performed the same in both 
horizontal and vertical direction with M = 12 shifts per sequence and 
with the frequencies f = (1, 4, 16, 64), corresponding to wavelengths of 
A = (2560, 640, 160, 40) pixels . The distances between monitor and cam- 
era were in the range of 5cm to 2m. To compare the proposed technique 
to the classic methods, the webcam was calibrated using the pinhole 
model of Sec. 5.1 and Zhang’s algorithm [246], which is implemented in 
the OpenCV library [26] . The light field camera was calibrated using the 
state-of-the-art method by Bok et al. [20]. Both methods use static checker 
patterns that were displayed on the reference monitor. In addition, the 
calibration is also performed with the state-of-the-art generic calibration 
method from Bergamasco et al. [15]. They calibrate the camera by itera- 
tively calculating the intersection of the rays with the monitor plane, and 
by minimizing the resulting coding error to the observed target features 
on a pixel level. In addition, they estimate the reference pose using an 
adapted iterative closest point method, where they calculate the perpen- 
dicular projection of the 3D reference features onto the corresponding 
rays, and then align the set of 3D features with the set of perpendicular 
projections in an iterative manner. 

Because the webcam has a smooth mapping from pixels to camera rays, 
the spatio-temporal phase unwrapping from Ch. 4 can be used without 
any restrictions, which allows mapping reference features to camera pix- 
els. However, when the light field camera is used, strong discontinuous 
appear near the edges of the microlenses. Consequently, these edges 
need to be detected using the edge detection presented in Sec. 4.4.4. And 
as a result, for these edge pixels, only the temporal unwrapping should 


117 


5 System Calibration 


Figure 5.9 Reference feature acquisition for the Lytro Illum camera: (a) shows the encoding 
of the monitor’s x-coordinate. (b) shows the coordinate uncertainty. (c) & (d) show details 
of the x-coordinate. (e) & (f) show detailed views of the coordinate uncertainty. (c) & (e) 
show the center region and (d) & (f) the bottom right region of the camera sensor. For better 
visualization, the color maps are stretched to maximize the contrast. 


be used. Figure 5.9 shows the acquisition of reference features for the 
Lytro Illum camera using phase-shift coding with probabilistic phase 
unwrapping. It can be seen that the phase measurement shows strong 
discontinuities near the boundaries of the microlenses. Also, in these 
areas, the uncertainty increases due to vignetting that is caused by the 
main lens and by the microlenses. This effect increases even more the 
closer the pixels are to the edge of the sensor. In addition, the Bayer 
pattern of the camera sensor affects the uncertainty in such a way that 
it increases for the red and blue pixels (because in this specific dataset, 
the spectrum of the displayed pattern seems to be centered around the 
central wavelength of the green pixel). 
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5.5.1 Error Metrics 


To ensure a fair comparison between the calibration methods, the differ- 
ent models are examined with regard to their point-to-ray distance of 
each ray to every observed feature in every monitor plane 


Eik = |Ppr xd,<-m;| . (5.73) 


For the error metrics, the mean distance and the root-mean-squared error 
(RMSE) are calculated: 


Mean (e) 5 WikEik” (5.74) 
Ži Wik “ik 
RMSE (e So wine, (5.75) 
Sa ik ik 


The comparison is here done using a weighted distance (with w,, = 77,7 ) 
that allows to assess the quality of the camera calibration without being 
too dependent on the quality of the used reference target features. For a 
demonstration of the benefit of using additional uncertainty information, 
the Euclidean distances are evaluated too by defining w,, = 1. In the 
following, weighted distances are symbolized by the variable <,,, and 
Euclidean distances by the variable £, . A comparison of the commonly 
used projection error on a pixel level is not possible, because in a generic 
camera model there is nothing like an “image plane” - there is just a set 
of rays. 


5.5.2 Initialization of the Alternating Minimization 


In principle, the presented generic calibration procedure can be initialized 
using the model-based approaches. For the webcam, the standard camera 
calibration and pose estimation provided by the OpenCV framework can 
be used. And for the light field camera, the calibration by Bok et al. with 
a succeeding standard pose estimation may help in the initialization. 
As a more generic alternative, one could also initialize using the generic 
relative pose estimation algorithm proposed by Ramalingam et al. [165]. 
The disadvantage of the method is, however, that the underlying camera 
model must be known. Different algorithms are needed for the central 


119 


5 System Calibration 


model of the webcam and the non-central model of the light field cam- 
era. In addition, the use of a two-dimensional planar calibration target 
(instead of a 3D target) adds ambiguities, which can be resolved only if 
there is a rough knowledge of the poses. Simulations and experiments 
showed that their method works in principle, but that the procedure 
is highly susceptible to noise. However, there are very severe compli- 
cations for only slightly non-central cameras, like the MLA-based light 
field cameras used in this work. As described by Ramalingam et al. [165] 
too, the procedure becomes extremely unstable, and no reliable pose can 
be estimated, even if only a very small noise is present. For light field 
cameras, the method is therefore rather unusable and will thus not be 
considered further in this work. 

Nevertheless, because using another calibration procedure increases 
the overall effort, it would be best to rely only on the here presented 
generic calibration method. In this context, it could be observed that in 
many cases it was even acceptable to just ” guess” the initial positions of 
the monitor. For example, although the monitor poses in figure 5.10(a) 
are randomly initialized, the optimization converges towards the optimal 
solution. However, even if the alternating minimization is strictly conver- 
gent, when using a random initialization, with some starting configura- 
tions it becomes possible that the optimization gets stuck in suboptimal 
solutions. Figure 5.10(d) depicts this situation, where some monitor poses 
are estimated to lie behind the camera. To further minimize the error, the 
algorithm causes all monitor poses to lie flat on top of each other, and 
to eventually have the same rotation. The estimated ray bundle is then 
slit-shaped and completely flat, which is not the correct solution. To avoid 
such problems, investigations showed that it helps to properly initialize 
the translation vector of the monitor poses in such a way that the order 
of distances between camera and monitor poses is approximately correct. 
Hence, it is useful to specify the distance to the camera during the data 
acquisition for a subset of monitor poses, so that the distance is approxi- 
mately known. E.g., the first three monitor poses could be placed about 
10 cm apart. Using only this subset of monitor poses, the ray parameters 
can be estimated with sufficient accuracy in only 20-30 iterations. And 
finally, this rough estimation of rays can be used to initialize the camera 
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Figure 5.10 Initialization and final result: The figures show the observed monitor area and 
the calibrated camera rays at start and end. Note the difference in scale. (a) & (b) Even with 
an initially very bad pose estimation, the procedure converges towards reasonable results. 
(c) & (d) A badly chosen initialization may converge to a suboptimal local minimum. 


calibration for the complete set of poses, where the remaining poses can 
of course be positioned arbitrarily. 


5.5.3 Convergence of the Alternating Minimization 


Figure 5.11 shows the convergence of the proposed method as a function 
of the weighted RMSE of the calibration error over the number of itera- 
tions. Here, the calibration was carried out with and without acceleration 
and with and without modeling the reference monitor. To investigate 


121 


5 System Calibration 


Bergamasco et al. 

Generic 

Generic+ Acceleration 

Generic+ Monitor 

Generic+ Monitor+ Acceleration 


108 


RMSE in pm 


0 200 400 600 800 1000 
Iteration 


Figure 5.11 Convergence of AM-calibration depending on the initialization: The plot 
shows the mean value and the +o-range of the convergence of the objective function for 
various initializations. 


the robustness against a bad initialization, the convergence behavior 
was investigated for 50 trials while random translations in the range 
+10cm per direction and random rotations +10° per axis were added 
to the starting pose. For comparison, the convergence behavior of the 
generic calibration method of Bergamasco et al. is also investigated, using 
the same initializations. Although they minimize a different metric in 
their optimization, the point-to-ray distance is evaluated here after each 
iteration so that a fair comparison can be made. 

The plot shows the average and the standard deviation of the RMSE 
over all trials, visualized by the thick line and the light background color. 
Figure 5.11 shows that the proposed method converges significantly faster 
than the method of Bergamasco et al., and it shows that it is less sensitive 
to a bad initialization, which is shown by the smaller standard deviation 
in the error. Starting from some initialization, the method of Bergamasco 
et al. leads to suboptimal solutions. The presented methods show slightly 
different behaviors in the convergence during the minimization, yet 
every trial converges very close to the same solution, visible by the very 
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low standard deviation in the last iterations. In addition, the improved 
convergence rate when using the Nesterov acceleration is clearly visible. 
Hence, the minimization converges to a sufficiently accurate result after 
about 300 iterations. And finally, it can be well acknowledged that the 
monitor model can push the total calibration error even further down. 
Interestingly, when estimating the monitor model, it can be observed 
that the convergence rate is slightly worse than compared to when it is 
not estimated. This can be explained by considering that the alternating 
minimization now consists of three subproblems, and with the increasing 
number of subproblems the convergence rate decreases. 

Since each ray is independent of one another, it is possible to process 
them in parallel, using a GPU. The optimization of 40 million pixels 
(Lytro Illum) and 20 reference poses then only takes a few seconds per 
iteration (Intel Core i7-6700, Nvidia GTX 1080 Ti, 16GB RAM). Therefore, 
the overall calibration for the light field camera converges after about 
45 minutes. The presented generic method is even faster and converges 
after only a few minutes when calibrating the two-megapixel webcam. 


5.5.4 Required Number of Poses 


Theoretically, a ray can be estimated with two different point observations 
(then A qq is positive definite). And to fit a pose, three non-parallel rays are 
needed (then A,, is positive definite) that observe different points (then 
A... is positive definite). With only two reference targets, the optimization 
always converges to a perfect fit, which of course is useless. An unambigu- 
ous and correct solution, however, can theoretically be obtained with at 
least three reference poses [165]. But of course, because the presented cal- 
ibration is based on a least-squares minimization approach, and because 
the impact of noise should be reduced, more reference targets are neces- 
sary. This becomes apparent in figure 5.12 that shows the calibration error 
when different numbers of reference targets are used. For this purpose, 
the camera was calibrated 100 times, where each time a fixed number 
of target patterns was randomly selected from a total set of 60 poses. The 
mean error of all calibrations and their +ø standard deviation are plotted 
over the number of used patterns. It can be seen that the overall calibra- 
tion error needs at least a minimum of 15-20 poses to result in a good 
calibration, whereas more poses increase the overall robustness of the 
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Figure 5.12 Dependency on the number of patterns: The plot shows the mean value and 
the +o-range of the error. 


method. Too few patterns, on the other hand, result in a very unreliable 
calibration. One can see similar results for the OpenCV calibration, al- 
though the dependency on the number of patterns is not as strong as 
compared to the proposed method. In summary, the proposed calibration 
needs more reference poses to correctly estimate the immense number of 
parameters. However, even with fewer poses, the error of the proposed 
calibration is several times smaller than the model-based calibration. 


5.5.5 Evaluation of the Calibration Error 


For a quantitative comparison of the different calibration methods, the 
calibration error will be compared in the following. To verify the positive 
influence of using the reference target uncertainty on the calibration, the 
method “Generic (E)” is investigated in addition. This method is the same 
as the presented method but does not use the uncertainty, and instead 
only minimizes the Euclidean point-to-ray distance by defining w;, = 1 
for all target features. For a faster calibration, the proposed methods 
use Nesterov acceleration. In addition, the webcam was calibrated with 
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Table 5.1 Calibration of the Logitech webcam. 


. Ey iN ym £, İn ym 
Logitech Wehen Mean RMSE Mean RMSE 
OpenCV (checker) — — 304.9 395.0 
OpenCV 366.1 415.9 369.7 424.6 
Bergamasco et al. 91.4 109.6 92.0 110.0 
Generic (E) 25.9 33.7 282 37.6 
Generic (H) 23.7 33.0 27.9 41.9 
Generic 23.6 32.4 27.8 39.4 


Generic + Monitor model 14.7 17.6 14.9 17.7 


OpenCV and the light field camera with the method of Bok etal., where 
checkerboard features were used as reference. For methods with the 
suffix “checker”, the error was evaluated only for those camera pixels 
that see the detected checker features. The other methods were evaluated 
for each camera pixel using the phase-shift features. 


5.5.5.1 Webcam Calibration 


Table 5.1 summarizes the result of the webcam calibration for the dif- 
ferent algorithms. It can be seen that the presented generic methods 
produce the best results. Even for the webcam, with its relatively simple 
optics, the presented method delivers both a smaller mean error and a 
smaller RMSE for both error metrics, resulting in a more precise geomet- 
ric calibration with fewer outliers at the same time. In the classic model 
from the OpenCV library, most outliers cannot be used because they are 
too far away from the model description. The generic model can effec- 
tively use each individual pixel as a source of information. This becomes 
particularly visible for the OpenCV calibration when only the error re- 
garding the checkerboard features is evaluated. Here, the error is smaller 
than when for every pixel all phase-shift features are evaluated. This 
demonstrates that the classic calibrations optimize the camera model for 
only a part of the pixels, namely the ones that observe checker features. 
The remaining pixels are interpolated through the camera model and 
thus have a larger calibration error. Figure 5.13 shows the error per pixel 
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Figure 5.13 Calibration error of the webcam: The OpenCV calibration on the left shows 
strong systematic errors due to the parametric modeling approach, while the generic model 
on the right shows a more noise-like result. 
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for the standard calibration and the presented generic one. It can be seen 
that the OpenCV calibration shows significant systematic errors, as the 
error increases or decreases depending on the distance to the center of the 
sensor. This wave-like behavior of the error is caused by the insufficient 
modeling capability of parametric models. Even for a simple webcam, 
the parametric modeling approach does not lead to perfect results. On 
the other hand, the generic approach calibrates each pixel individually, 
and hence, almost no systematic errors appear. The resulting calibration 
error is overall much smaller and has an almost noise-like characteristic. 
The proposed methods also perform better than the generic approach 
by Bergamasco et al. Even if the uncertainties are not taken into account 
and only the Euclidean distance is minimized, the presented method still 
outperforms the method by Bergamasco et al. Moreover, it can be seen 
that additional information about the coordinate uncertainty further 
improves the calibration. Inaccurate points are weighted less strongly 
and therefore have a weaker effect on the result. Interestingly, because 
“Generic (E)” directly minimizes the Euclidean RMSE, the respective 
value is smaller than the same metric for the “Generic” method. How- 
ever, the corresponding mean value of the uncertainty-based method 
is smaller, since outliers have less influence on the calibration. When 
using a hierarchical phase unwrapping approach with “Generic (H)”, the 
mean error slightly increases, although the used phase-shift coding with 
M = 12 shifts already strongly reduces the noise. The corresponding 
RMSE values increase slightly more than compared when the probabilis- 
tic unwrapping is used in “Generic”, meaning that outliers are caused by 
errors in the hierarchical phase unwrapping. Finally, using the monitor 
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Figure 5.14 Histogram of point-to-ray distances concerning all phase-shift features (Log- 
itech Webcam). The generic model creates a much tighter distribution with fewer outliers 
as compared to the classical calibrations (note the logarithmic scale). Outliers can be further 
suppressed with uncertainty information and by modeling the reference monitor. 


model and estimating its parameters reduces the overall calibration error 
even more. 

Figure 5.14 illustrates the results of the calibrations by showing the 
distribution of all point-to-ray distances. The OpenCV calibration shows a 
very widespread distribution that is not symmetric due to the systematic 
modeling errors. The error distributions of the generic approaches are 
tighter, shifted to lower values, and are close to a normal distribution, 
which is to be expected since the errors are calculated from the set of 
independently calibrated rays. 


5.5.5.2 Light Field Camera Calibration 


Similar conclusions can be drawn with the Lytro Illum light field cam- 
era. Table 5.2 summarizes the results of the calibration for the different 
algorithms. Due to the more complex optics and the more extensive 
optimization associated with this camera, the differences here are much 
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Table 5.2 Calibration of the Lytro Illum camera. 


Ew in ym E£. İN ym 
oe Mean RMSE Mean RMSE 
Bok et al. (checker) - - 448.4 851.8 
Bok et al. 375.6 758.8 7165.1 8686.0 
Bergamasco et al. 922.1 1696.5 1041.8 1720.5 
Generic (E) 771 185.3 163.1 438.5 
Generic (H) 60.4 153.4 165.2 510.7 
Generic 494 105.8 155.5 457.4 


Generic + Monitor model 43.3 91.4 149.8 439.7 


greater and the superiority of the proposed generic calibration becomes 
even clearer. Although the model by Bok etal. is very sophisticated, it is 
adapted strongly to the few checkerboard features and only produces 
good results here. But if the same model is evaluated for all phase-shift 
features for every pixel, then this leads to huge RMSE values caused 
by many outliers. In this case, one can see particularly well that a low 
dimensional model-based approach cannot ideally describe every pixel 
of a camera with complex optics, such as the light field camera. Moreover, 
the benefit of using uncertainties becomes very well apparent: the quality 
of pixels in microlens-based light field cameras (and the ability to model 
the corresponding rays accurately) deteriorates towards the edges of the 
microlenses, leading to increased uncertainties (see figure 5.9). These can 
however be suppressed effectively by the proposed generic method, lead- 
ing to much smaller mean errors and RMSE values for both error metrics. 

The method by Bok etal. can calibrate the center of each microlens very 
well. Here, their calibration error reduces to about 60 um for the best 
pixels. This results in a relatively good reconstruction of the central sub- 
aperture image, as will be analyzed in detail in Ch. 6. However, the more 
the pixels move away from the microlens center, the larger becomes the 
error. This reduces the overall calibration quality, as shown in the results. 
Also, the method by Bok et al. returns a light field with only 35 million 
pixels, as compared to the total of 41 million pixels of the raw data. The 
worst pixels, which are between neighboring microlenses, are not used 
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Figure 5.15 Calibration error of the Lytro Illum. Left: At a global view, the error is inde- 
pendent of the position on the sensor. Right: The error increases near the microlens edges. 


in the modeling and are therefore cut off. Thus, they cannot be analyzed 
in the evaluation made here. However, the proposed generic model 
can effectively calibrate the rays of every pixel of the sensor, whereby 
not only good calibration results in the centers of the microlenses are 
achieved, but also at the edges, where it is very difficult to describe 
the light field camera with a uniform model. By using the uncertainty 
of the target features, these pixels at the microlens edges can be easily 
identified as outliers. Therefore, they are automatically compensated and 
have less influence on the pose estimation, which further improves the 
ray estimation. Interestingly, using the hierarchical phase unwrapping 
to obtain the monitor coordinates with “Generic (H)” instead of the 
probabilistic approach with “Generic” has a more significant effect on 
the light field camera than it had on the webcam. The overall calibration 
error is much larger, which is caused by the pixels at the microlens edges. 
Here, due to the strong vignetting, the signal-to-noise ratio is reduced, 
resulting in higher phase noise. This again further demonstrates the 
advantages of the proposed probabilistic phase unwrapping. Finally, 
using the monitor model and estimating its parameters further reduces 
the overall calibration error. When compared to the webcam calibration, 
the improvement here is smaller. 

Figure 5.15 shows the calibration error of the proposed generic method 
for each camera pixel. Although the error increases near the microlens 
edges, it is still very small. The reason that these pixels cannot be de- 
scribed better by the generic camera model is that in reality there is a 
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Figure 5.16 Histogram of point-to-ray distances concerning all phase-shift features (Lytro 
Illum). The generic model creates a much tighter distribution with less outliers as compared 
to the classical calibration (note the logarithmic scale). Outliers can be suppressed even 
more with uncertainty information and by modelling the reference monitor. 


superposition of vision rays. This means that the light cone belonging 
to the ray is either strongly elliptically distorted or simply consists of 
the superposition of multiple individual cones. A disadvantage of the 
generic camera model is that it can only return the mean value of the 
corresponding light cone for such pixels, which does not necessarily 
reflect reality. Nonetheless, because a superposition of light cones causes 
a high uncertainty in the phase-shift coding, the corresponding pixels 
have an insignificant influence on the uncertainty-based calibration pre- 
sented here. Especially during the pose estimation, outliers are strongly 
suppressed, and thus, the overall calibration result still turns out well. 
While the method by Bergamasco et al. delivers good results for the 
webcam, it does not seem to work well with the light field camera. Al- 
though the calibration of the webcam shows that their approach works, it 
seems that it does not generalize as well as the proposed method and that 
it has difficulties with the poor quality of the pixels at the edges of the 
microlenses. The procedure diverged in the experiments. Only after im- 
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Figure 5.17 Experimental deflectometry setup. 


proving the initialization for a few iterations using the proposed method 
and by excluding the pixels with the highest uncertainty, a convergent 
result for their method could be obtained, which still has a smaller error 
than the calibration by Bok et al. 

Figure 5.16 summarizes the results and shows the distribution of all 
point-to-ray distances for the different calibrations of the Lytro Illum. The 
method by Bok et al. results in a multi-modal distribution with the lowest 
peak at about 60 um. Also, several peaks systematically appear at higher 
distances, which is due to the difficulties of modeling a light field camera. 
These peaks correspond to the average calibration error of the individual 
sub-aperture images, as will be discussed in detail in Ch. 6. The method 
by Bergamasco etal. results in a distribution with many errors at high 
values (more than 1 mm). The proposed methods, on the other hand, are 
much tighter with peaks at far lower values. Moreover, larger errors from 
minimizing only the Euclidean distance can be reduced to smaller ones 
by using the generic calibration with uncertainty-based weighting. And 
in addition, using a monitor model further improves the result. 


5.5.6 Mirror-Based Pose Estimation 


The experimental setup of the deflectometry system used in this work 
is shown in figure 5.17. The reference monitor is the same as the one 
used for the camera calibration, and the camera is the Lytro Illum, which 
was calibrated using the generic calibration. To perform a deflectometric 
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Figure 5.18 Result of mirrored pose estimation for the experimental setup from figure 5.17. 
Top left: Reference monitor. Top right: Camera and estimated camera rays. Bottom: Refer- 
ence mirrors. 


measurement, the relative pose between camera and monitor must be 
calculated using the procedure from Sec. 5.4. For this purpose, a pre- 
cision surface mirror with A/20 flatness is used as a reference mirror, 
i.e., with the reference wavelength of 632.8 nm, the mirror has a maxi- 
mum peak-to-valley deviation from the perfect plane of 31.64nm. The 
mirror is placed in 10 different positions and each time reference points 
are recorded using phase-shift coding. Using the adapted generic pose 
estimation (5.70), the mirrored pose of the virtual monitors is estimated. 
Then, using the procedure shown in Sec. 5.4.1, a linear solution is found 
for the true pose between the camera and monitor. Subsequently, the non- 
linear optimization (5.72) improves the final estimate. Figure 5.18 shows 
the mirror planes and the resulting pose of camera and monitor. Inter- 
estingly, the calibration error varies considerably during the estimation 
steps. The estimation of the virtual poses results in an RMSE of 92.0 um. 
However, after the linear solution for the true pose is found, it increases 


132 


5.5 Evaluation 


N 


Hight in mm 


0 


Figure 5.19 Visualization of the display surface. Left: Because of the weight on the corners, 
the monitor surface is twisted. Right: The monitor hangs above the surface and due to 
gravity, the surface is slightly bent. 


substantially to 640.2 um. An explanation for this is that when two mirror 
positions are only slightly inclined to each other, the distance of the in- 
tersection line of both planes to the measurement setup can grow to very 
large values, which leads to numerical instabilities in (5.67) and (5.69). 
Still, the subsequent nonlinear optimization can compensate for this, so 
that the RMSE of the final pose estimation decreases again to 95.2 um. 


5.5.7 Shape Estimation of the Reference Target 


It could already be shown that the monitor model improves the calibra- 
tion. However, it is not yet clear whether the model also provides realistic 
values. To verify this, the monitor was measured in two positions. In the 
first measurement, the monitor lies on the ground, and both the upper 
left corner and the lower right corner are loaded with weights. Hence, the 
monitor should show a torsion. A second measurement shows the moni- 
tor in the deflectometry setup, see figure 5.17. Here, the monitor hangs 
above the surface under test, and the screen points downwards. This 
causes the outer areas of the monitor to also bend downwards, which 
results in an increased curvature of the display surface. The monitor 
parameters can be obtained after calibration, and with them, the shape 
of the monitor can be calculated. Figure 5.19 shows the results for both 
measurements. The figure shows very well that the first monitor has 
a strong torsion, while the second one is slightly curved, as was to be 
expected. The distance between the highest and the lowest point for the 
first measurement is 2.5mm and for the second measurement 0.8 mm. 


133 


5 System Calibration 


This was also approximately verified by placing a straight metal bar 
on the surface and by measuring the distance between the bar and the 
screen surface with a ruler. Therefore, in conclusion, the calibration of 
the monitor shape is satisfactory. 


5.6 Summary 


In this chapter, the calibration of the deflectometric measuring system 
was described. The main contribution was a new calibration technique for 
the generalized camera model. The proposed method splits the calibra- 
tion into two parts, a ray calibration and a pose estimation, and it applies 
an alternating minimization to efficiently optimize the immense number 
of parameters. Dense calibration features were obtained using phase-shift 
coding techniques, and the measurement uncertainty that was estimated 
during the pre-processing could be used in the optimization. A simple 
analytical solution to minimize the ray subproblem was presented. Fur- 
ther, the pose was optimized by decoupling rotation and translation, and 
by using gradient descent on the rotation manifold. Since calibration 
references, i.e., standard LCD screens, are generally not ideal, the shape 
and also the refraction at the cover glass were modeled, which allowed 
the estimation of the reference parameters to be efficiently integrated into 
the generic calibration. Because alternating minimization typically has 
a slow convergence rate, Nesterov’s acceleration scheme was modified 
to speed up the optimization process. Since in a deflectometric measure- 
ment setup, the reference monitor is not in the camera’s direct field of 
view, a mirror-based pose estimation was adapted, which further could 
be efficiently combined with the presented generic calibration procedure. 

Finally, experimental evaluation verified the advantages of the pro- 
posed camera calibration method over conventional and other general- 
ized approaches. In this context, the benefit of using additional infor- 
mation about the uncertainty of the calibration target coordinates was 
demonstrated, and it could be shown that modeling the reference target 
leads to a considerable improvement in the calibration. 
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The calibration methods from the last chapter can already describe all 
of the optical components very precisely. The generic camera model 
achieves a high degree of accuracy, but in the process of the calibration, 
information is discarded, namely, the topological relations between the 
pixels. For many areas of optical metrology, this does not pose a prob- 
lem, as often only the geometric ray properties are relevant [145, 209, 
247]. In profilometry, for example, a projector illuminates a scene with 
a coded pattern sequence and each scene point can thus be assigned to 
a projection ray and a vision ray, allowing for a direct triangulation of 
the point’s depth. The same principle cannot be implemented in deflec- 
tometry without further work, since it is not the specular object that is 
optically encoded here, but the distorted mirror image of the reference 
pattern generator. Therefore, direct triangulation of the surface cannot 
be performed. Rather, the object is measured indirectly by triangulation 
of the normal field, as will be explained in Ch. 7. An important step in 
this triangulation is the forward and backward projection from camera 
rays to 3D points and vice versa. While it is very easy to calculate the 
3D points along the corresponding ray for each pixel, it is very difficult 
with the generic camera model to find out to which pixel a 3D point is 
projected. More specifically, it would be extremely time-consuming to 
calculate for each 3D point its closest camera ray (or rays), since a com- 
plete search over all rays would have to be performed for each point. For 
an indirect triangulation of the specular surface, the completely generic 
camera model is therefore unsuitable. Hence, further processing of the 
calibrated rays is required to recover the neighborhood information be- 
tween the pixels or, in the case of the light field camera, it is necessary to 
restore the 4D relation between the camera rays. 

Apart from the difficulties that arise in deflectometry, light field cam- 
eras also have many other applications, where the geometric calibration 
of the camera itself is not crucial, but rather the correct reconstruction of 
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the light field and its SAIs. Examples of this are depth estimation, chang- 
ing the perspective on the scene, digital refocusing and artificial bokeh, 
or hyperspectral image reconstruction, as discussed in Sec. 3.2. The em- 
phasis is therefore rather on the reconstruction of the image content than 
on the ray parameters. To use light field cameras for such applications, 
it is necessary to obtain the 4D information of the light field. And this 
information must first be decoded from the raw 2D sensor data of the 
respective light field cameras. Unfortunately, due to their complex struc- 
ture, their calibration is very difficult and usually precisely tailored to the 
particular type of light field camera. Hence, specially adapted algorithms 
have to be used and a great deal of effort must be invested in modeling 
the camera optics. However, as already described in the last chapter, 
low-dimensional models are often not sufficient to represent all prop- 
erties of an optical system—especially when it comes to sophisticated 
and highly specialized optical systems like light field cameras. In fact, 
the characteristics of the optics are already very precisely incorporated 
in the generic ray bundle. Therefore, it makes sense to directly utilize the 
calibrated rays for light field decoding as well. 

To overcome the issues of highly specialized decoding algorithms, and 
to use the already precisely estimated camera rays, this chapter presents 
an algorithm that uses the generic camera calibration as a basis for recon- 
structing a light field from the unconstrained set of rays. Hereby, a generic 
light field reconstruction is realized, which can be used to reconstruct 
light fields from arbitrary light field imaging systems, independent of 
whether the camera is based on microlenses, mirrors, or coded apertures, 
or whether it is realized by employing a camera array. 

In the following section, related works in the field of light field decod- 
ing and reconstruction are presented. Then, in Sec. 6.2, a new generic 
approach for light field reconstruction is proposed that only uses the 
information contained in the set of rays obtained via the generic cam- 
era calibration. Finally, Sec. 6.3 experimentally validates the proposed 
method by reconstructing real light fields obtained with different light 
field acquisition systems and compares it to state-of-the-art methods. 
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6.1 Related Works 


The first work on light field calibration was done in the context of multi- 
camera arrays [208]. However, these cannot simply be transferred to other 
light field acquisition systems such as MLA-based light field cameras. 
Due to their complex design, the light field has to be decoded from the 
raw sensor image using sophisticated algorithms. Furthermore, each lens 
(main lens and microlens) is affected by lens aberrations, i.e., asubsequent 
rectification of the decoded light field is necessary to obtain correct geo- 
metric information relevant for image processing and optical metrology. 

Among the microlens-based light field cameras, the standard plenoptic 
camera (or unfocused plenoptic camera) has been studied the most, as itis 
useful in consumer applications and image processing without requiring 
metric calibration [137]. To still be able to compensate for optical distor- 
tions, Ng and Hanrahan [136] suggested a digital correction of the lens 
aberrations without metric calibration by digitally re-sorting aberrated 
rays to where they should have terminated. The first metric calibration 
of a commercial light field camera was proposed by Dansereau et al. [44]. 
To decode the light field from the sensor data, they first estimate the 
grid parameters of the MLA. This is done by detecting the microlens 
centers from corresponding white images and building a regular grid 
that best approximates the detected centers. The light field is then de- 
coded by assigning a spatial coordinate to each microlens and an angular 
coordinate to every pixel under each microlens, and by converting the 
hexagonal grid of microlenses into a rectangular one. Subsequently, the 
decoded light field is calibrated using a camera model consisting of ten 
intrinsic parameters and five distortion parameters, allowing the SAIs 
to be corrected by inverting the distortions. In this process, the calibra- 
tion is initialized using the SAls and then refined by minimizing the ray 
re-projection error, i.e., the distance between the 3D positions of checker- 
board features and the camera rays. Cho et al. [38] perform an erosion 
operation on the white image and estimate the microlens centers by using 
clustering and a parabolic fitting. They then decode the light field directly 
from the hexagonal layout using a barycentric interpolation. However, 
they neither perform metric calibration nor rectification. Bok et al. [20], 
in contrast, presented a method that can extract a rectified light field di- 
rectly from raw sensor data, avoiding intermediate reconstruction steps. 
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In addition, they introduce a new projection model for microlens-based 
light field cameras that contains a smaller number of parameters than 
the previous methods. Instead of checkerboard corner features, they use 
line features extracted directly from the raw data. Further, the microlens 
centers are calculated individually without fitting a grid and the light 
field is decoded by barycentric interpolation. Eventually, the light field is 
rectified and the camera parameters are calculated by minimizing the dis- 
tance between line segments and camera rays. Since all methods rely on 
a correct description of the microlens grid, Schambach and Puente León 
[180] propose an extended model that additionally takes into account 
the natural and mechanical vignetting of the microlenses and main lens. 
As a consequence, the calibration becomes more accurate, especially in 
SAIs corresponding to the peripheral regions of the angular dimension 
where the vignetting effect is more prominent. 

For a focused plenoptic camera, the distance between the MLA and 
the sensor is not equal to the microlens’ focal length. As a result, these 
cameras achieve a higher spatial resolution with decreasing angular res- 
olution. To further increase the depth of field, the manufacturer Raytrix 
proposed multi-focus cameras in which the microlenses have different 
focal lengths [148]. Unlike the unfocused plenoptic camera, where each 
pixel under the microlens can be assigned to an SAI, the (multi-)focused 
plenoptic camera works like a micro camera array, where each microlens 
can be interpreted as a virtual camera observing a very small section of 
the scene. By using neighboring microlenses to perform stereo-based 
triangulation, a virtual depth map can be estimated. And by stitching 
the micro-images together using this depth information, an all-in-focus 
image of the scene can be reconstructed. However, because the virtual 
depth map can only be interpreted in a relative manner, a metric cali- 
bration is necessary. A first approach for the calibration of a multi-focus 
plenoptic camera was suggested by Johannsen et al. [97]. They extract a 
depth map and an all-in-focus image from the camera data and model 
the resulting synthetic image using a 15-parameter model that includes 
lateral distortion as well as a depth-dependent distortion. Heinze et al. 
[80] extended the model by considering the different focal lengths of 
the microlenses. Zeller et al. [239] introduced a new depth distortion 
model that is directly derived from the theory of depth estimation in a 
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focused plenoptic camera, and in addition, they extended the residual 
of their optimization to three dimensions by including the virtual depth. 
A disadvantage of the above methods is that they depend on Raytrix’s 
software package since they do not start from raw data but the synthetic 
all-in-focus image and the virtual depth map. 


6.2 Generic Light Field Reconstruction 


To be able to extract light field information from the raw data, the pre- 
viously discussed methods must initially detect the centers of the mi- 
crolenses with high precision. But even with a subpixel accurate detec- 
tion, most of the time only the rays near the center of the microlenses 
are precisely calibrated. The camera rays at the boundary of the mi- 
crolenses are very difficult to model in all approaches, and therefore 
these pixels are often discarded. Another disadvantage of the classical 
methods is the model-based calibration in general. It cannot describe 
highly local errors such as the strong distortions at the boundaries of 
the microlenses using only a low-dimensional model. Hence, a generic 
camera calibration should be advantageous. However, the biggest disad- 
vantage of the common light field reconstruction methods is that they 
each are only applicable to a single type of camera. For example, the 
methods by Dansereau et al. [44] and Bok et al. [20] can only be used with 
MLA-based light field cameras whose microlenses are exactly focused 
onto the sensor. 

Since the calibrated rays describe the camera very well, it also makes 
sense to make use of it for the light field reconstruction. In fact, the generic 
ray bundle already represents the light field perfectly and optimally takes 
into account all distortions of the camera optics. More precisely, this 
means that the set of rays is effectively an irregularly sampled version 
of the distortion-free light field. For the light field reconstruction, this 
implies that no specific model of the used camera has to be developed, 
the sensor data does not have to be decoded according to this model, it is 
not necessary to detect the centers of any microlenses, and no hexagonal 
sampling of an MLA has to be compensated. Instead, the irregularly 
sampled light field has to be transformed into an adequate representation. 
Of course, since this is completely independent of the camera optics used, 
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a fully generic reconstruction algorithm is obtained that can be applied 
to any type of light field camera. 

In the following, it will be explained how a conventional light field can 
be reconstructed from the unconstrained ray bundle. For this purpose, 
a regular and discrete parametrization of the target light field is first 
found based on the irregular data. Subsequently, it is shown how the 
irregular data is interpolated in a suitable way to this newly defined 
regular grid of light field pixels. And finally, to be useful for optical 
metrology applications, the intrinsic parameters of the reconstructed 
light field are derived. 


6.2.1 Parametrization of Light Field Coordinates 


To decode a light field from the raw sensor data, the camera must first 
be calibrated, e.g., by using a generic calibration method as described 
in Sec. 5.2. As a consequence, all the preprocessing steps of the conven- 
tional state-of-the-art light field calibration algorithms are not needed 
at all. Even more, it does not actually matter what type of light field 
acquisition device is used. After applying the generic camera calibration, 
a ray bundle is obtained in an arbitrary coordinate system, which can 
easily be transformed into a camera-fixed coordinate system using the 
normalization presented in Sec. 5.2.7. Since most light field algorithms 
do not work with Plücker-coordinates, as the last step, the camera ray 
parameters are transformed into light field coordinates. To do so, the rays 
are first transformed into the camera-fixed coordinate system, by shifting 
the origin and rotating the axes. Afterward, the intersections of the rays 
with the two-plane representation of the light field are calculated. For 
this, the u, v-plane is placed orthogonal to the z-axis into the origin of 
the coordinate system, i.e., this corresponds approximately to the center 
of the camera’s exit pupil when an MLA-based light field camera is used. 
The s, t-plane is placed parallel to this at an arbitrary distance f, see 
figure 6.1. Thus, each camera ray 1; = (d;,m;)" can be described by four 
light field coordinates 5; , t; , U;, 0; : 


i7 


A (5; ti, Ñy, By, 1)7 = PT 1,, (6.1) 
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Figure 6.1 Two-plane parametrization of the light field. The ray 1; intersects the u, v- and 
the s, t-planein (s;, ti, u;, v;) . The intensities in the planes visualize the spatial distribution 
of the intersection points as a 2D histogram. The u, v-plane lies in the plane of the camera’s 
main lens. The s, t-plane corresponds to a projection on the rectangular sensor. 


with \ # 0, using the coordinate transformation matrix T that is derived 
in Sec. 5.2.7, and with a ray-to-light-field projection operator P [94]: 


f 000 -10 
0of01 0 0 

P=|]0 000 -1 O|. (6.2) 
0001 0 0 
0010 0 0 


6.2.1.1 Regular Light Field Grid 


To reconstruct a light field from the bundle of rays associated with the 
camera, the calibrated ray coordinates must first be transformed into a 
standardized grid. Afterwards, the observed ray intensities L(5;,t;, U;, U;) 
can be interpolated to a discretized light field, which is parametrized in 
the same two-plane representation as previously. The complete set of 
real camera rays, which is described as a set of 4D points, is arranged 
in an irregular 4D grid. Still, the classical light field algorithms (e.g., 
refocusing and depth estimation) require a regular grid with uniform 
spacing. Therefore, this irregular grid of continuous ray coordinates has 
to be interpolated to a discrete light field described by a regular grid. 
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Hence, it is necessary to define a regular grid with integer grid points 
(s,t, u, v) € [0, N, ms 1] x [0, N, En 1] x [0, N, z 1] x [0, N, =~ 1] (6.3) 


with a fixed number of samples N,, N,, Na, N, in the respective dimen- 
sions. After the discrete target light field has been defined, the set of real 
camera rays are transformed, for which the parameter space of the actual 
ray geometry must be estimated. For this, the domains of the real light 
field dimensions have to be determined by analyzing the intersection 
points of the rays with both planes of the light field representation. It goes 
without saying that among all rays there are also isolated outliers that 
deviate so strongly from the others that it is not worthwhile to consider 
them in the interpolation. Therefore, the 2D densities of the intersection 
points should be investigated by making use of a 2D histogram analysis, 
see figure 6.1. To place the regular grid structure into the 2D density 
of the irregular data, a threshold value on the histogram data enables 
defining the grid extension. A threshold of, e.g., 1% ensures that most of 
the camera’s rays are within the range defined by the grid. 

Since the real light field parameters are specified in physical units, e.g., 
millimeter, they have to be transformed to the previously defined discrete 
4D-pixel grid by shifting the minimal value s, , to, u, , Vo , Normalizing 
the width of the histogram As, At, Au, Av, and considering the number 
of samples. The normalized coordinates are then defined by 


= Ši “(N 1) t= - to (N, —1) 
Dez Oe gee (6.4) 

Uj — Uo 1) BF Yo Oy 1) 

u; i r Ma — j Vi = Ar io ` 


This still results in irregularly spaced data, which however can now 
be interpolated more easily to obtain the desired regularly sampled 
light field. The number of 4D cubes in each direction and the length of 
their edges could in principle be defined arbitrarily, but it is advisable to 
incorporate knowledge about the physical camera. For example, the Lytro 
Ilum camera considered in this work has microlenses with a diameter 
of about 15 pixels. Thus, because the camera is of the unfocused design, 
this sampling can be used directly as a basis for the discretization of the 
angular coordinates of the u, v-plane, where N, = N, ~ 15 due to the 
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(a) (b) (c) 


Figure 6.2 Different sampling patterns of the u,v-plane: The dots represent the pixel 
coordinate, and the lines limit the pixel area. (a) Cartesian sampling. (b) Polar sampling 
with equidistant radius spacing, using R, = ı +n.(c) Polar sampling with equal pixel 


area, using R,, ~ rng +1; 


circular shape of the camera’s main lens. The sampling of the s, t-plane 
can be determined in the same way, e.g., by the number of microlenses 
in front of the sensor, whereby it is advisable to choose ig 


A As 
obtain square-shaped spatial pixels. i 


6.2.1.2 Polar Parametrization of Angular Coordinates 


As can be seen in figure 6.2, the parametrization of the u, v-plane us- 
ing Cartesian coordinates is not always ideal. If the grid is defined to 
enclose the entire circle, then the light field is reconstructed in areas 
where no rays pass through the u, v-plane. If the grid is placed inside 
the circle, a sufficient number of rays will pass through each light field 
pixel. However, information is discarded at the edges. Hence, it would be 
better to directly use a polar parametrization of the angular coordinates, 
which would allow the entire information to be captured without sam- 
pling unneeded areas. Therefore, the angular coordinates are defined 
by polar coordinates r and ¢. To further obtain a resolution compara- 
ble to the Cartesian sampling, the number of samples is chosen to be 


N, = N, and N, ~ N, . The coordinates are then linearly sampled in 
the domain r € |- w, “| and d € fo, | . Here, N, should 
$ 


be a multiple of 4 to be able to obtain horizontal and vertical EPIs com- 
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parable to the Cartesian sampling, i.e., EPI(r, s)\g-9 = EPI(u,s) and 
EPI(r,t), en EPI(v,t). While the advantage of a polar parametriza- 
tion is a more efficient sampling, there are also disadvantages. When 
sampling the angle and the radius in equidistant steps, the effective pixel 
size grows with increasing radius, see figure 6.2(b). As a result, fewer 
rays pass through smaller pixels, which would result in a lower signal-to- 
noise ratio for these pixels during the interpolation to the discrete light 
field. 

A possible solution here is to define the radius sampling in such a 
way that each pixel has the same area. This can easily be achieved by not 
using a linear sampling of the radius coordinates but by transforming 
their domain, see figure 6.2(c). The radius Ry of the center-most pixel has 
the area A, , whereas the area A,, of the remaining pixels is represented 
by a sector of an annulus: 


Ay = rR, (6.5) 
A, = 1(R2 — R2_,) + for n>0. (6.6) 
Ng 
By requiring A, = A,_, = = A, and using mathematical induction, a 
formula for the radius is obtained: 
R, = Roy/nNg +1 = aaa (6.7) 
2 $ 


The reconstruction of the light field using polar coordinates can then 
be performed in the same way as when Cartesian coordinates are used. 
The only distinction is the different sampling grid in the angular plane, 
for which the Cartesian coordinates u; , v; need to be transformed into 
polar coordinates r; , p; using 


r; = sign(v;) y u; + vi, (6.8) 


) mod m. (69) 
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A reverse transformation from polar coordinates back to Cartesian coor- 
dinates is achieved by 


N ,—1 
u; = r; cos p; + F ; (6.10) 
N,—1 


2 


v‚=r,sinp; + (6.11) 
The only differences between polar sampling with equidistant ra- 
dius spacing and polar sampling with equal pixel area are that the 
discrete pixel coordinates are slightly different. However, using a po- 
lar parametrization for conventional light field applications could be 
a challenge, since existing algorithms are based on rectangular data. 
In particular, commonly used techniques based on CNNs cannot work 
with this representation without further modification, since the standard 
convolution operators would first have to be replaced by polar ones. 


6.2.2 Weighted Interpolation of Irregular Data 


After the parameters of the light field have been defined, each corre- 
sponding light field pixel can be determined for every ray by finding 
the discrete grid point that is closest to the ray’s light field coordinates. 
Since the rays and the grid are normalized to the same scale, the set of 
rays N?” uv that affects a pixel (s, t, u, v) can be found using a rounding 
operation to the closest integer [-] . As a result, each light field pixel is 
only influenced by rays that lie in the corresponding 4D cube 


m —_ Je. m y Yi 
Ngao = is Salle - ra] j (6.12) 
® [v;| 


where each individual ray is assigned to the nearest pixel when using 
m = 1. When using a polar parametrization of the angular coordinates, 
the parameters u , v need to be replaced by r,&. To allow a ray to influence 
more than the nearest pixel, higher-order neighbors can be utilized with 
m>1,m € N". The intensity of a discrete pixel can then be calculated 
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from the intensity values ofthe corresponding rays as a weighted average: 


IM Xien”, ET w,(u, v, s,t) L(8,, tis Ugs V;) (6 13) 
S,t,u,v) = — : : 
Vienna w(u, v, 8, t) 


For the weighting factor, the distance between a ray’s light field parame- 
ters and its correspondence in the grid is calculated. In order to consider 
larger deviations less, the error is squared and exponentially weighted: 


w;(u, v, s,t) = = exp (—||(s, t, u, 0)” — (8;, tis U; vaP) ‘ (6.14) 


(A 74d 4d 4 
a 

A separate weighting of the individual light field coordinates is not re- 
quired because these have already been brought to a unified basis by the 
normalization (6.4). To additionally benefit from the results of the generic 
camera calibration, an error measure £; is taken into account, e.g., the pixel- 
wise ray re-projection error (5.34). This suppresses poorly calibrated cam- 
era rays, which often do not have good optical properties, e.g., dead pixels 
or pixels at the edges of microlenses, which can be strongly distorted. 

Regarding computational resources, it remains to say that the direct 
calculation of the set of nearest neighbors NV", „ » is at first extremely in- 
efficient. Since for each discrete pixel (s, t, u, v) a complete search over all 
irregularly distributed rays (s,,t,, u;, v;) must be performed, the complex- 
ity is O(n”) , with n being the number of pixels. Using more efficient algo- 
rithms, such as k-d trees, can decrease the complexity to O(n logn) [39]. 
Fortunately, however, due to the ray coordinates being normalized to a 
convenient range, it is even better to simply assign each irregular coordi- 
nate (s,,t,,u,,v;) to a discrete pixel directly. This is much faster since the 
nearest neighbor of a continuous coordinate is directly its closest integer 
analogon. Hence, a rounding of the ray coordinates directly returns the 
corresponding set of nearest neighbors. In addition, to allow the assign- 
ment of higher-order neighbors, a formula can be given that allocates 


the rays toa set N” „ „ using only fast and simple operations: 
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ei 
%] 


u T sign ® _ a 
a vi m-1; „m | sign (v; — [v; u 
s| [s;] = 2 en sign (s; — [s;|) > i EN tuv 
t [t] sign (t; — [t;]) 

(6.15) 


Hence, a complexity of O(n) is achieved. Even more, due to all rays 
being independent of one another, the creation of the nearest neighbor 
set (6.12) and the weighted interpolation (6.13) can be parallelized using 
GPU hardware. The reconstruction of a complete light field then takes 
only a few seconds (in the case of the Lytro Illum with a 40 Mpx Sensor, 
using an Nvidia GTX 1080 Ti, an Intel Core i7-6700, and 16 GB RAM). 


6.2.3 Intrinsic Camera Parameters 


Apart from the radiometric reconstruction of the light field, the geometric 
ray properties are relevant in many applications. For optical metrology, 
3D reconstruction, or other areas of computer vision, a mapping is needed 
to transform pixel coordinates into world coordinates and, vice versa, to 
project points from world coordinates onto the pixel plane. Hence, to use 
the light field camera for the deflectometric reconstruction of specular 
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surfaces, as will be presented in Ch. 7, the intrinsic camera parameters 
need to be available. Unlike the classic pinhole camera model, where 
each world point is mapped to only a 2D pixel pair, the same world point 
can be mapped to more than one 4D light field pixel. Illustratively, this 
can be understood by the observation that a light field camera can also 
be interpreted as an array of individual virtual sub-cameras, where an 
observed point is mapped to a 2D pixel pair in each individual camera’s 
virtual sensor plane. Hence, every angular coordinate needs a projection 
equation from world points to the respective spatial pixels. The perspec- 
tive projection of a point x = (x,y, z)" onto spatial pixels through a 
pinhole located at the optical center is obtained with (5.1). Following 
figure 6.3, projecting the same point through a pinhole located at the 
shifted position u then results in 


st ah eae = ay (6.16) 


> s 


II 
+ 
S 


(6.17) 


In conclusion, the intrinsic camera parameters required for the perspec- 
tive projection of a point x onto the spatial pixels (5, t)" are described for 
each angular coordinate u, v by a projection matrix (comparable to the 
standard pinhole camera model from Sec. 5.1). Since the optical centers 
of the individual sub-cameras are slightly displaced to each other in the 
u, v-plane, a corresponding translation vector is required to represent the 
relative offset to the central sub-camera. The projection is represented by 


5 fou —u 
(wo. «(i f 2) = (=). (6.18) 
1 001 0 


For every SAI, the pinhole is shifted in u, v-direction, and the respective 
center of the sensor is shifted in the opposite direction. Since all SAIs 
share the same virtual sensor plane, such parametrization results in an 
interesting effect that shows up in many light field camera calibration 
algorithms: negative disparities can be obtained. For camera array-based 
light field cameras, the minimal disparity is usually zero and corresponds 
to a point at optical infinity. In contrast, the plane of zero disparity in the 
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configuration presented here is the “focal plane” of the light field, where 
z = f. Points closer to the camera have a positive and points farther 
away a negative disparity. Hence, a point at the focal plane is imaged 
to the same spatial coordinate s = x in every sub-aperture image. Of 
course, the choice of the focal length f influences the light field represen- 
tation. However, the actual value is not important, since it only results 
in a different light field parametrization. Also, the “focal plane” of the 
light field should not be confused with the plane where the imaging has 
the highest sharpness. Theoretically, a light field camera has an infinite 
depth of field. Practically however, due to the SAls of a real light field 
camera being imaged through only a very small aperture, their depth of 
field is finite but sufficiently high [60]. In summary, the value of f does 
not correspond to a conventional focus. 

Because the light field reconstruction is performed with regular grid 
parameters and discrete pixels (s,t, u, v), the corresponding intrinsic 
parameters need to be derived to project world points to pixel coordinates. 
Thus, this results for every SAI in a camera matrix and a translation vector 


( 0 cs o) (i) 
I = 0 fi Ct (v) , tuv = t,(v) * (6.19) 
0 0 1 


The corresponding parameters can directly be determined from the two- 
plane parametrization of the light field by using (6.4) and (6.17): 


N,-1 
a (6.20) 
N 
= —, (6.21) 
Au N, —1 Ug 5, 
Av N, 1 Vg T ts 
= v— N $ 
cv At N,-1 + ( t 1) At z (6 23) 
Eu = = Ay, (6.24) 
tiv) = —v oo Tv: (6.25) 
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In this context, when using a polar parametrization of the angular coor- 
dinates, the intrinsic parameters can be calculated for each r, ¢-pair by 
transforming them to their corresponding u, v-value using (6.10) & (6.11). 

The forward projection of a point x = (z,y,z)' (measured in the 
coordinate system fixed to the central subcamera) onto the light field 
pixel (s, t, u, v) can be found with 


z (5) =K,,(x+t,,) - (6.26) 
1 


The backward projection of light field pixels (s,t, u,v) to points x(z) 
along the associated camera ray is given by 


x(z) =zKj} (i) = tuw: (6.27) 
1 


Finally, for every light field coordinate (s, t, u,v), the corresponding ray 
in Plücker-coordinates can be obtained easily as 


x(z) — x(0) 
d(s, t, u,v) FORETOK (6.28) 
m(s, t, u,v) = x(0) x d(s,t, u,v). (6.29) 


When the light field camera is used for depth estimation, the disparity 
of a scene feature is estimated [218]. The disparity appears as the slope 
of a line in the EPIs, as detailed in Sec. 3.2. Because negative disparities 
may also be observed with the light field parametrization presented here, 
the SAIs must first be brought to a uniform basis, i.e., a disparity offset 
must be subtracted. Using the disparity and with the help of the baseline 
between the SAIs, it is then possible to convert back to the metric depth: 


sn fbs = l (6.30) 


ds gm doffset,s dy = doffset,t 


where d,, d, represent the disparity estimated from the horizontal and 
vertical EPI, b, and b, represent the baselines in the respective directions, 
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and doffset,s and dorpsct., are the offsets of the disparity. They can be 
calculated from the intrinsic parameters: 


b, ca t,(0) — t,(1) 7 doffset,s = c, (0) ~~ c,(1) 7 (6.31) 
b, =+t,(0) —t,(1), doffset,t = ¢,(0) —¢,(1). (6.32) 


If, in addition, the light field resolution is chosen to result in square- 
shaped spatial and angular pixels As(N,—1) = At(N, —1) and Av(N,, — 
1) = Au(N,—1), then it follows that b, = b, and doffset,s = doffset t, Which 
means that the disparities calculated from different EPIs should be equal. 


6.3 Evaluation 


This section evaluates the presented light field reconstruction algorithm 
and compares it to the state of the art. To show the advantage of the pre- 
sented generic approach, three different light field camera systems are 
evaluated: a Lytro Illum with an RGB sensor, a monochromatic Raytrix 
R5, and a prototype K|Lens lens mounted onto an RGB camera sensor. 
The Lytro Illum and the Raytrix R5 are both microlens-based light field 
cameras. The former is an unfocused plenoptic camera [137], and the 
latter is a focused plenoptic camera [123], see Sec. 3.2. The K|Lens cam- 
era is based on an “Image Multiplier”, which contains a mirror tunnel, 
similar to a kaleidoscope. Using this, a multi-view capture of the scene 
is directly generated and mapped onto the camera sensor [128]. Conse- 
quently, all three cameras are based on very different camera models, and 
for conventional camera calibration, all would need a different calibration 
procedure. However, a generic calibration works independently of the 
camera. Here, the ray geometry of the vision rays of each camera was esti- 
mated using the generic camera calibration from Sec. 5.2.1. Subsequently, 
test scenes were captured to be used as a basis for the comparison of the 
proposed light field reconstruction. 

In the following, first, the presented algorithm is analyzed, and a quali- 
tative evaluation of the generic light field reconstruction for all light field 
cameras is conducted. Then, the quality of the geometric reconstruction 
is investigated by a quantitative comparison of the calibration error. 
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(a) maxzoom setting. (b) minzoom setting. 


Figure 6.4 Lytro Illum data. Top: Sensor image. Bottom: Detailed views. 


6.3.1 Light Field Reconstruction: Lytro Illum 


The Lytro Illum light field camera has a sensor of size 7728 x 5368 px 
with a pixel pitch of 1.4 um overlaid with a Bayer pattern. Hence, with 
the help of demosaicing, color information can be obtained. In front of 
the sensor is an array of hexagonally arranged microlenses, with each 
microlens having an approximate diameter of 20 pm and a focal length 
of f = 40 pm. Since the camera is an unfocused plenoptic camera, the 
distance of the microlenses to the sensor plane corresponds to their focal 
length. The main lens of the camera is a zoom lens with a selectable 
focal length equivalent in the range of 30 mm to 250 mm. Therefore, two 
configurations are investigated for the Lytro Illum camera: A maxzoom 
setting with a focal length equivalent of 250 mm, and a minzoom setting 
with a focal length equivalent of 30 mm. 

Figure 6.4 shows the sensor data corresponding to both zoom settings 
after demosaicing. From a coarse point of view, the images look like the 
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image using a conventional camera. Only when taking a closer look the 
microlenses can be seen. It can be seen that the f-number matching does 
not work perfectly for the maxzoom setting since the microlenses show 
strong vignetting effects here. For the minzoom setting, these effects also 
occur, but not quite as strongly. To compensate for vignetting, the images 
are divided by a so-called white image, i.e., an image of a white scene 
taken with the aid of an optical diffuser. The pre-processed raw data can 
then be used in the light field reconstruction algorithms. For acomparison 
of the presented generic method to the state of the art, the light field 
reconstruction methods of Dansereau et al. [44] and Bok etal. [20] are 
evaluated as well. Both methods only work with unfocused plenoptic 
cameras and can thus only be tested on the Lytro Illum data sets. 


6.3.1.1 Ray distribution and Grid Parameters 


With a pixel pitch of 1.4 um and a microlens diameter of 20 um there 
are about 14.3 x 14.3 pixels underneath each microlens. Since the Lytro 
Illum is of unfocused design, this corresponds directly to the angular 
resolution. A discrete angular sampling can then be found by round- 
ing up or down. To obtain a central SAI, the angular resolution should 
be an odd number. Following Sec. 6.2.1, the spatial resolution can be 
found by dividing the sensor size by this angular sampling factor, re- 
sulting in a spatial resolution of approximately 520 x 376 px for each 
SAL Still, to allow for a meaningful discussion of the proposed light 
field reconstruction relative to other methods in the literature, the Lytro 
Illum data is evaluated by choosing the resolution of the light field grid 
to be (N,, Ni, Nu, N,,) = (625, 434, 15,15), which is the same as the re- 
constructed light field of Dansereau etal. In comparison, the light field 
obtained from the reconstruction method by Bok et al. has a resolution 
of (N,, Ni, Na, No) = (552, 383, 13, 13), meaning that the worst pixels at 
the edges of the microlenses are cut off. 

Now that the resolution of the discrete target light field is known, the 
real light field parameters must be transformed into this newly defined 
4D grid. For this purpose, the intersections of the vision rays with the two 
planes of the two-plane parametrization of the light field are analyzed. 
The histogram analysis of the intersection points for the minzoom and 
maxzoom setting are shown in figure 6.5. For the maxzoom setting, the 
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(a) maxzoom setting. Left: s, t-plane. Right: u, v-plane. 


(b) minzoom setting. Left: s, t-plane. Right: u, v-plane. 


Figure 6.5 Lytro Illum: Histogram of ray-plane intersections. 


histograms are very regular. The u, v-plane shows a circular distribution. 
Since the u, v-plane is placed at the point of the highest ray density, one 
can indirectly observe the aperture of the main lens here. The diameter 
of the aperture is estimated to be 3.4cm, which also corresponds approx- 
imately to what can be roughly measured with a tape measure when 
looking into the objective from outside the camera. The s, t-plane shows 
a rectangle, which corresponds to a projection of the rectangular sensor. 
The extension of this rectangle depends on the arbitrarily chosen distance 
f between the two planes and is therefore not important. The histograms 
of the minzoom setting show strong optical distortions. The s, t-plane 
shows a rectangle, which has a pincushion distortion. This precisely cor- 
responds to the distortions produced by the non-ideality of the main lens. 
Since the generic camera model works completely independent of any 
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low-dimensional parametric model, this optical distortion is perfectly 
described by the generic set of vision rays. Interestingly, the u, v-plane 
is no longer circular but has a hexagonal structure. The aperture of the 
Lytro is therefore most likely hexagonal, meaning the projection of the 
aperture can be seen here. For the maxzoom setting, where the aperture is 
already very small, as seen in the white images, the aperture is probably 
“completely open” and therefore circular. To keep the f-number matching, 
the Lytro Illum seems to have a variable input aperture. 

With the help of the histograms, the dimension of the grid parameters 
can be determined. For this purpose, rectangles are fitted around the 
histograms such that at least 99% of the ray-plane intersections should 
lie within the rectangular area. This effectively suppresses outliers. Every- 
thing that is not exactly in the grid is not necessarily completely lost and 
can still have an influence on the neighboring light field pixels, as long 
as the number of nearest neighbors in (6.12) is chosen to be high enough. 


6.3.1.2 Qualitative Evaluation of Subaperture Images 


Due to the relatively freely chosen sampling grid, in some cases no corre- 
sponding ray can be assigned for some of the discrete 4D pixels. For this 
reason, if the interpolation order is too low, this can lead to a perforated 
reconstruction. Hence, for the generic reconstruction, up to second-order- 
nearest neighbors were used for the angular domain by setting m = 2, 
and up to third-order-nearest neighbors were used for the spatial domain 
with m = 3 in (6.12). Increasing the order of interpolation too much does 
not change the result of the reconstruction significantly, because the expo- 
nential weight of (6.14) automatically punishes rays that are too far away 
very strongly. The only major disadvantage of a higher interpolation 
order is the longer reconstruction time, since the intensity of each ray 
must be considered for more than just the nearest neighbor. 

The reconstruction of the central SAI of the maxzoom dataset captured 
with the Lytro Illum is shown in figure 6.6. Here, only rays from the center 
of the u, v-plane were used in the reconstruction. It can be seen that the 
presented generic method can reconstruct the scene correctly, although 
there were absolutely no presumptions about the internal optical struc- 
ture of the camera and no information on the correlations between rays 
and pixels on the sensor was used. In detail, it can be seen that the generic 
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u 


(b) Dansereau et al. 
rt . 


(c) Proposed generic light field reconstruction. 


Figure 6.6 maxzoom setting. Left: SAI from the center of the u, v-plane. Right: Details. 
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(a) Bok etal. 


= 


(b) Dansereau et al. 


(c) Proposed generic light field reconstruction. 


Figure 6.7 maxzoom setting. Left: SAI from the edge of the u, v-plane. Right: Details. 
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method can reconstruct the light field even near object edges very well. 
The reconstruction results of Dansereau et al. and of the generic method 
are relatively similar and show a slightly sharper result compared to the 
method of Bok etal. However, moving away from the center and looking 
at peripheral SAls that still contain information, one sees that the quality 
of the images for Dansereau et al. and Bok et al. decreases significantly, 
while the result of the generic method only becomes slightly blurrier, see 
figure 6.7. In addition, the image of Bok et al. shows black borders, i.e., 
invalidated pixels, at the top and on the right. The generic method shows 
a similar effect, which, depending on how tight the dimension of the 
s, t-plane is chosen using the histogram, could also be stronger. These 
pixels define areas of the light field where there is no real ray. Therefore, 
no information can be obtained. Bok et al. avoid this problem in the lower- 
left area by simply reducing the size of the image. For Dansereau etal., a 
similar effect shows up, if one chooses SAls that lie even further at the 
edge. However, since here their reconstruction is of such poor quality, it 
does not make sense to use it for the comparison made here. 

The reconstruction of the central SAI of the minzoom dataset is shown 
in figure 6.8. Here one can see similar results to before. Dansereau et al.’s 
method shows the sharpest reconstruction followed by Bok et al.’s method. 
At the edge of the image, the generic reconstruction shows a similar per- 
formance to Bok etal., visible in the bottom detail image. The minimally 
blurrier appearance of the generic reconstruction in the center near the 
alarm clock is due to the relatively freely chosen sampling of the light 
field. In order to reconstruct the entire light field, regions at the periph- 
ery of the image were also reconstructed in this case. And because the 
light field was strongly rectified, the area in the center of the image 
shrinks. Consequently, fewer pixels remain for this area. Bok et al. avoid 
this problem by heavily cropping the entire image. Dansereau et al. do 
not have this problem either, as they do not perform rectification and 
undistortion. Their rectification algorithm only works for the older Lytro 
camera, which has a relatively simple optical setup. But it does not yield 
useful results for the newer Lytro Illum, which has a more sophisticated 
lens setup that reduces optical aberrations and that enables a variable 
zoom setting. Eventually, this means that the light field camera model of 
Dansereau et al. is not generalizable, and it does not even seem to be ap- 
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(a) Bok et al. 


(b) Dansereau et al. 


(c) Proposed generic light field reconstruction. 


Figure 6.8 minzoom setting. Left: SAI from the center of the u, v-plane. Right: Details. 
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(a) Bok et al. 


(c) Proposed generic light field reconstruction. 


Figure 6.9 minzoom setting. Left: SAI from the edge of the u, v-plane. Right: Details. 
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plicable to all types of unfocused plenoptic cameras. Overall, this means 
that lens aberrations are not compensated for, which can be clearly seen 
in the barrel distortion in figure 6.8(b) and results in straight lines being 
bent. Yet again, when moving away from the center and looking at the 
SAIs at the edge of the u, v-plane, the quality of the images deteriorates, 
see figure 6.9. The reconstructed light field of Dansereau et al. and Bok 
etal. becomes much blurrier, while the quality of the generic method 
becomes only slightly worse. Strong vignetting artifacts appear in the 
upper left corner of the image, which (strangely) do not appear in the 
generic reconstruction, even though all methods are provided with the 
same devignetted sensor data. One possible explanation for this is that 
the vignetting increases the calibration error £ of the generic camera 
calibration. Rays with a high calibration error are superimposed by rays 
with a lower error, which leads to the compensation of the vignetting 
during the weighted interpolation of (6.13). Further, in the detail views, 
Dansereau et al. and Bok etal. show some pixels that are completely red, 
green, or blue, which are presumably dead pixels. These pixels do not 
appear in the generic reconstruction, since they also have a relatively 
high calibration error. So again, these pixels are efficiently suppressed 
by the weighted interpolation, and the missing information is obtained 
from neighboring rays. 


6.3.1.3 Qualitative Evaluation of Epipolar Plane Images 


Regardless of the quality of the reconstructed SAls, the advantage of the 
proposed method becomes apparent in another area. Apart from the 
central view that only incorporates spatial information, the light field 
contains much more, i.e., angular information. If one fixes an angular 
and a spatial coordinate in the 4D light field pointing in the same di- 
rection, e.g., u and s, one gets a 2D slice of the light field, the so-called 
epipolar plane image (EPI), see Sec. 3.2. Lines of different slopes can be 
seen, whose orientation represents the depth of the observed object point. 
Depth estimation in light fields is thus reduced to a simple local orienta- 
tion estimation in these EPIs, whereby the quality of the estimation is 
significantly influenced by the calibration. The higher the quality of the 
lines, the better the result of the depth estimation. 
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Figure 6.10 EPIs of the maxzoom setting in comparison: From top to bottom and left to 
right in the order Bok et al., Dansereau et al., proposed generic method. 


For the maxzoom setting, figure 6.10 shows examples of horizontal and 
vertical EPIs generated by fixing u or v to its center coordinates and by 
selecting pixel lines for the s (red) or t (green) coordinate, respectively. 
The coordinates are chosen for each reconstructed light field to approx- 
imately be at the same position. The EPIs of Dansereau et al. show strong 
deviations from the ideal epipolar geometry, visible by the curvy epipo- 
lar lines. This is caused by the poor generalizability of the method which 
was developed for the old Lytro camera and works only moderately well 
for the newer Lytro Illum. Also, there are some errors at the top and the 
bottom. These areas correspond to pixels that are located at the boundary 
of the microlenses, where the imaging is more strongly distorted. For the 
EPIs reconstructed using the method of Bok et al. and the generic method, 
it can be seen that the epipolar geometry is reconstructed with higher 
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Figure 6.11 EPIs of the minzoom setting in comparison: From top to bottom and left to 
right in the order Bok et al., Dansereau et al., proposed generic method. 


quality, observable by the straight lines. The slope of the epipolar lines is 
different for each reconstruction method, and it depends on the chosen 
parametrization of the light field. For the generic method, the general 
slope direction can be shifted by changing the distance fbetween the u, v- 
and s, t-plane. This does not change the information in the light field at 
all but only changes the “focal plane” of the light field, cf. Sec. 6.2.3. The 
parametrization of Bok et al. places the focal plane at infinity, hence the 
reconstructed light field can be interpreted as an array of virtual cameras 
with the optical centers of each camera being located at the same spatial 
pixel position. Thus, a point corresponding to zero slope results in zero 
disparity, which then theoretically implies a distance of infinity. As the 
EPIs show, the parametrization of the method of Dansereau etal. and the 
generic method seem to have the focal plane located near the alarm clock. 
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Example EPIs of the minzoom setting are depicted in figure 6.11. Now, 
the general slope direction is very similar and all methods show a good 
reconstruction of the epipolar geometry. Since the minzoom setting has 
less microlens vignetting, the reconstruction seems to work better for 
all methods. Only Dansereau et al.’s method still shows blurring in the 
upper and lower areas, and the reconstruction of Bok et al. shows dis- 
tortions only in the most distant edge coordinates. However, while the 
epipolar geometry is reconstructed very well for all methods, only for 
the generic method and Bok et al.’s method, the distortions of the lenses 
are compensated, resulting in a rectified light field. 


6.3.2 Comparison of Angular Sampling Grids 


Another advantage of the proposed method is the free choice of sampling. 
Therefore, a more suitable sampling grid can be used. The polar sampling 
of the u, v-plane presented in Sec. 6.2.1 is better adapted to the data of 
the Lytro Illum light field camera, and can therefore better represent 
the light field. No unnecessary information is sampled and the result 
is more compact, or rather, more information is contained in the same 
amount of data. With the same resolution and thus the same size of 
the reconstructed light field, polar sampling effectively removes less 
information while representing the relevant information more accurately 
than Cartesian sampling. Figure 6.12 shows the comparison, whereby 
the light field is illustrated as an array of SAls. 

In detail, it is important how the polar sampling is implemented. As al- 
ready described in Sec. 6.2.1, two options for the choice of radial sampling 
are considered. For the first choice, the radius is set in equidistant steps. 
For the second choice, the radius is set such that the pixel areas of all 
pixels of the sampling grid have equal size. This has the advantage that 
the signal-to-noise ratio remains the same for each pixel. Still, a minor 
disadvantage becomes apparent when analyzing the EPIs. Since now the 
step size of the radius is nonlinearly sampled, the lines in the EPIs are 
no longer straight but curved. The conventional light field depth estima- 
tion, which analyzes the slope of the lines, can therefore no longer be 
applied here without further consideration, as it would provide incorrect 
results or would make corresponding corrections necessary, e.g., a local 
rescaling of the estimated slope of the lines. The comparison of the EPIs 
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(a) Polar sampling results in a more efficient (b) Cartesian sampling reconstructs unnec- 
representation of the data. essary peripheral areas of the u, v-plane. 
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(c) Top: Linear polar sampling with equidistant radius spacing. Bottom: Nonlinear polar 
sampling with equal pixel area. 


Figure 6.12 Comparison of Cartesian and polar sampling. (a) and (b) The light field as an 
array of SAIs. (c) Polar EPIs for ¢ = 0. 


is shown in figure 6.12(c). In conclusion, it is therefore recommended 
to use polar sampling with equal pixel area if the light field camera is 
only used as a multi-view camera array. For use in the field of depth 
estimation, where the slope of the epipolar lines is analyzed, sampling 
the radius in equidistant steps is preferable. 


6.3.3 Super-Resolution through Implicit Ray Interpolation 


An interesting continuation of the generic light field reconstruction ap- 
proach is the possibility to customize the dimension of the discrete pixel 
grid. This allows, for example, a light field super-resolution approach to 
be implemented in a very simple way. That is, the spatial resolution, the 
angular resolution, or both can be artificially increased. Of course, the 
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(a) Original. (b) Bilinear interpolation. (c) Super-resolution. 


(d) Original. (e) Bilinear interpolation. (f) Super-resolution. 


Figure 6.13 5xSuper-resolution through implicit ray interpolation. (a)-(c) show details 
from the central SAI of the minzoom setting. (d)-(f) show details of the maxzoom setting. (a) 
and (d) show the original resolution of the light field. (b) and (e) show the result when 
bilinear interpolation is applied to the images. (c) and (f) show the result of the proposed 
generic super-resolution approach. 
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resolution cannot be indefinitely increased, since a corresponding ray 
is not always available for each super-resolved discrete coordinate. As 
the light field pixels become smaller with increasing resolution, fewer 
and fewer rays will hit a pixel. As a result, the reconstructed light field 
may contain holes. To fill these, the generic approach can now be used 
directly by considering neighboring rays. That means, in (6.12) m > 1 
must be chosen. 

An example super-resolved reconstruction of the light field using the 
maxzoom and minzoom settings is shown in figure 6.13. Here, 5xsuper- 
resolution was applied spatially, resulting in the increased resolution of 
(15, 15, 3125, 2170) . To interpolate missing data from the 4D neighbor- 
hood, m = 4 in the angular domain and m = 7 in the spatial domain are 
chosen. For comparison, a 5 xoversampling using bilinear interpolation 
on the SAIs is shown as well. While the bilinear interpolation increases 
the resolution, the result is still very blurry. The generic super-resolution 
approach, on the other hand, demonstrates impressively that the reso- 
lution could be considerably increased. Even small and distant details 
suddenly become clearly visible. The reason why the resolution of the 
images can be increased so much is due to the very high redundancy 
contained in light fields. Conventional super-resolution approaches must 
first estimate the depth of the scene and can subsequently map the scene 
points onto a virtual sensor [215, 218]. Alternatively, they are based on 
learning-based methods with complex CNN architectures [185, 235]. 
However, they all have in common that they require an already recon- 
structed light field. The advantage of the simple approach presented 
here is that none of this is necessary. Instead, super-resolved SAIs can be 
reconstructed directly from the generic ray bundle. 


6.3.4 Light Field Reconstruction: Raytrix R5 


While the generic method can already reconstruct light fields very well 
from the raw data of the Lytro Illum camera, it also works with other 
light field cameras without any further adaptation. To show this, the 
light field of a Raytrix R5 was reconstructed. 

The Raytrix R5 light field camera has a monochromatic sensor of size 
2048 x 2048 px with a pixel pitch of 5.5 um. In front of the sensor is an array 
of hexagonally arranged microlenses with about 25 x 25 px underneath 
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Figure 6.14 Raytrix R5: Raw sensor data and detailed view. 


each microlens. A 35 mm fixed focal length objective with a hexagonally 
shaped aperture was used. The aperture is chosen such that the f-number 
matching is approximately fulfilled. The aperture cannot be rotated, 
and therefore it is not perfectly aligned with the hexagonal microlens 
grid, resulting in dark areas at the edge of the microlens, see figure 6.14. 
Because this camera is of the focused design, the distance between the 
microlenses and the sensor plane is different from the microlenses’ focal 
length, see Sec. 3.2. In addition, the camera is a multi-focus plenoptic 
camera, which means that there are three types of microlenses, each with 
a different focal length. 

As before, to transform the continuous light field parameters to a 
discrete pixel grid, the intersections of the camera rays with the u, v- 
and s,t-plane are analyzed. Figure 6.15 shows the histograms of the 
intersection points. The s, t-plane is quadratic due to the quadratic sensor, 
and the u, v-plane shows a circular distribution. 

Because the Raytrix camera is a focused plenoptic camera, the number 
of pixels under each microlens no longer corresponds directly to the 
angular resolution. Rather, the microlenses now show micro-images of 
the scene. Each micro-image can therefore be interpreted as a virtual 
camera, where, depending on the position of the microlens, both the 
optical center of the micro-camera is shifted and a different small section 
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Figure 6.15 Raytrix R5: Histogram of ray-plane intersections. Left: s,t. Right: u, v. 


of the scene is shown. The pixels below the microlens hence encode 
spatial information, while the microlens position contains both spatial 
and angular information. The angular resolution of the camera must 
therefore be roughly estimated. Because the micro-images in figure 6.14 
show approximately a three-fold redundancy in both the horizontal and 
vertical direction, the angular resolution is chosen to be N, = N, = 3. 
The spatial resolution is slightly oversampled and set to N, = N, = 1000. 


6.3.4.1 Qualitative Evaluation 


The reconstruction of the central SAI is shown in figure 6.16. One can 
see that the scene is reconstructed correctly and that even details are 
recognizable. Since the Raytrix light field camera is built differently than 
the Lytro not everything in the reconstructed image is in focus. With this 
camera, the depth of field and the focus distance are now determined 
by the main lens and the main lens setting. Because the lens used in this 
experiment is not optimally selected for the Raytrix R5, strong vignetting 
effects are visible at the edges of the microlenses, as can be seen in the raw 
data, see figure 6.14. For the Lytro Illum camera, microlens-vignetting 
reduces the quality of the edge SAIs, whereas for the Raytrix the effect 
can theoretically also be seen everywhere in the central image. Very dark 
pixels at the edges of the microlenses cause reconstruction artifacts in the 
image due to a devignetting operation. However, this unwanted effect 
could be resolved by using a more suitable lens with a hexagonal aper- 
ture, rotating the aperture to be aligned with the hexagonal grid, and 
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manually adjusting the aperture’s opening to the correct size. This effect 
is particularly strong in the lower-left area. While strong vignetting is 
visible here, an additional effect occurs. Because the camera is a focused 
plenoptic camera, the images under the microlens contain spatial infor- 
mation. The position of each microlens encodes both angular and spatial 
information. This has the consequence that the micro-images overlap to 
different degrees depending on the distance of the observed objects. L.e., 
the degree of spatial redundancy seems to be distance-dependent. For 
the very close area at the bottom left, the micro-images do not overlap 
anymore and in addition, the strong vignetting creates perforated areas 
in which the scene cannot be observed completely. The missing informa- 
tion must therefore be interpolated from distant neighboring rays, which 
leads to the noticeable artifacts here. 

A minor disadvantage of the generic light field reconstruction is that 
the multi-focus property of the Raytrix camera cannot be explicitly taken 
into account at first. This leads to blurred pixels being superimposed 
with sharp pixels in the reconstruction. Because the generic light field 
reconstruction in this work is intended to be completely independent of 
the observed scene and because it does not model the focal properties of 
the rays, this problem cannot be solved at first. However, one possibility 
to avoid this difficulty would be to classify the pixels beforehand and to 
assign them to the three categories of microlenses, i.e., to the three focal 
lengths. With this, three separate light fields could be reconstructed for 
each microlens category, where of course each one would only observe a 
perforated part of the scene. 


6.3.5 Light Field Reconstruction: K|Lens 


Unlike the previous cameras, the K|Lens is not based on microlenses. 
To be more precise, the K|Lens light field camera is a light field objec- 
tive lens that has to be mounted onto any full-sized camera sensor. For 
this experiment, the K|Lens was mounted on an Allied-Vision Prosilica 
GT4907C RGB sensor. The sensor has a resolution of 4864 x 3232 px with 
a pixel pitch of 7.4 um. Figure 6.17 shows the sensor image of the camera. 
The different views are clearly visible, which are mirrored differently by 
the kaleidoscope effect. Because the objective lens is not perfectly aligned 
with the sensor, the whole image array is slightly rotated. 
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Figure 6.17 K|Lens: Sensor image. 


As with the other cameras, the K|Lens was calibrated using the generic 
calibration and then the intersections with the u, v- and s, t-planes were 
calculated. Figure 6.18 shows the histograms. The s, t-plane is rectangular, 
while the u, v-plane consists of 3 x 3 small dots. These dots correspond 
to the optical centers of the respective 3 x 3 views. The dots have a faint 
butterfly-shaped boundary, which is most likely caused by lens distor- 
tions having the consequence that there is no single center of projection. 
The choice of the discrete light field grid is very straightforward for the 
KlLens. The angular dimension is chosen to be N, = N, = 3 and the 
spatial dimension is given as one third of the sensor resolution with 
N, = 1621, N, = 1077. 


6.3.5.1 Qualitative Evaluation 


For the light field reconstruction, only the direct neighbor was considered 
in the angular domain with m = 1, while second-order neighbors were 
considered in the spatial domain with m = 2. The light field reconstruc- 
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Figure 6.18 K|Lens: Histogram of ray plane intersections. Left: s-t-plane. Right: u, v- 
Plane. For better visualization, the colormap of the u, v-histogram has a logarithmic scale. 


tion for the K|Lens camera as an array of SAIs is shown in figure 6.19. All 
in all, one can see that the different views of the camera are reconstructed 
very well. Since for the generic calibration, the arrangement of the pixels 
on the sensor is completely irrelevant, and since only the rays outside 
the camera are of importance, the kaleidoscope effect is automatically 
compensated by the generic reconstruction. In addition, due to the nor- 
malization of the generic ray bundle, the slight rotation of the K|Lens 
objective with respect to the sensor is corrected. Looking at the results in 
more detail, see figure 6.20, there are hardly any differences between the 
sensor data and the reconstruction, both in the central view and in the 
SAIs at the edge. 


6.3.6 Camera Intrinsics and Calibration Error 


Apart from the reconstruction of the light field and the qualitative analy- 
sis of the result, an exact characterization of the ray geometry is essential 
in many areas of computer vision, for optical metrology in general, as well 
as for deflectometry in particular. Since the presented method is based 
on generic camera calibration and to be comparable with the very same, 
the ray re-projection error £ from Sec. 5.5.1 needs to be investigated. This 
error corresponds to the distance between a geometric camera ray and an 
observed point on a reference target. To evaluate the error experimentally, 
a commercially available monitor was used as a reference target, whose 
pixels serve as reference coordinates. The monitor was captured from 
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Figure 6.19 Kl|Lens: Reconstructed 3 x 3 light field. 


different poses using the different cameras and camera configurations. 
In each pose, phase-shift features were acquired using the techniques 
from Ch. 4. For all cameras and both settings of the Lytro Illum camera, 
the raw data with the measured phase-shift features were converted to 
light fields using the presented generic reconstruction method. For com- 
parison to the state of the art, the light fields corresponding to both Lytro 
camera settings were additionally reconstructed with the method by 
Bok et al. Further, with the help of the respective camera parameters, the 
camera rays could be determined for each light field. Subsequently, using 
these camera intrinsics and the generic pose estimation from Sec. 5.2.5, 
the 3D coordinates of the feature points were determined, and the ray 
re-projection error as an average value over all rays could be calculated. 

The comparison of the different methods applied to the Lytro Illum 
is shown in table 6.1. The method of Dansereau et al. [44] could unfor- 
tunately not be evaluated, as the rectification algorithm and thus the 
determination of the camera parameters only works for the older Lytro 
but does not provide any meaningful results for the newer Lytro Illum. 
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(a) Center of sensor image. (b) Center SAI. 
(c) Upper right area of sensor image. (d) Upper right SAI. 


Figure 6.20 K|Lens: (a) and (c) Sensor image. (b) and (d) Generic reconstruction. For 
better visualization the mirroring of the sensor image is corrected in (c). 


As expected, the generic calibration from Sec. 5.2 has the lowest calibra- 
tion error, since each pixel can be calibrated individually and hence with 
high precision. However, this result cannot be compared directly to the 
other methods, since the correlations of the vision rays and the light 
field information are lost or cannot be used directly with this camera 
model. It is therefore only used to represent a lower limit of the calibra- 
tion error. More importantly, the table shows that the presented generic 
light field reconstruction method has a much smaller mean error and 
RMSE than the method of Bok etal., resulting in a better calibration with 
fewer outliers. And thus, the ray geometry is estimated much better al- 
though the qualitative comparison of the light field reconstruction for 
both methods is very similar. This is because the ray calibration of the 
presented generic light field reconstruction itself could be carried out 
very precisely, starting from the generic calibration. The nonidealities of 
the optics are accurately included in the generic camera model, and the 


175 


6 Light Field Reconstruction 


Table 6.1 Comparison of the ray re-projection errors for the Lytro Illum camera. 


Lytro Illum minzoom setting maxzoom setting 
£ in ym ein ym 
Method Mean RMSE Mean RMSE 
Bok etal. 375.6 758.8 455.2 1117.7 
Generic calibration 43.3 91.4 139.8 376.2 


Generic LF-reconstruction 97.1 155.0 231.4 431.1 


Table 6.2 Comparison of the ray re-projection errors for the Raytrix R5 and K|Lens. 


Raytrix R5 K|Lens 

£ in ym £ in pm 
Method Mean RMSE Mean RMSE 
Generic calibration 31.1 77.0 73.1 123.1 


Generic LF-reconstruction 121.23 166.8 130.3 178.7 


generic light field reconstruction only needs to sample the fully rectified 
light field from the resulting generic ray bundle. In contrast, the method 
by Bok etal. fits a low-dimensional camera model with a low-dimensional 
distortion model to the camera data. Deviations from this model cannot 
be taken into account, and therefore the calibration error increases. Even 
though the generic reconstruction is based on the generic calibration, 
the ray re-projection error is slightly worsened by the interpolation and 
rounding operations of Sec. 6.2.2. A direct comparison of both camera set- 
tings reveals that the calibration for the maxzoom setting provides slightly 
inferior results, regardless of the method. Due to the stronger microlens 
vignetting for this setting, as shown in figure 6.4, the peripheral areas of 
the microlenses capture much less light, which increases the uncertainty 
of the calibration features, and thus worsens the calibration. 

Because the software for the Raytrix and the K|Lens are not available as 
open-source, only the result of the presented generic methods is shown 
here. Table 6.2 shows the results of the respective calibrations. Similar 
to before, the generic calibration can be seen here as a lower limit. For 
the K|Lens, it can be seen that during the light field reconstruction the 
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Table 6.3 Intrinsic camera parameters for the Lytro Illum’s maxzoom setting. 


Bok etal. [20] Generic LF-reconstruction 
fs = 4295.7 f. = 4295.3 

fı = 4295.7 fe = 4295.3 

c,= 278.1 c,= 273.9 

c= 192.0 CG = 194.3 

baseline = 2.99 mm baseline = 3.16 mm 


calibration error again worsens only by a small factor. In contrast to 
this, the generic calibration of the Raytrix R5 camera is very accurate. 
However, when reconstructing the light field from this, the error increases 
strongly. This is mainly due to the strong interpolation artifacts that can 
also be observed in the reconstructed SAls, see figure 6.16. Nevertheless, 
the quality of the light field reconstruction of all cameras and all zoom 
settings is still very close. 

For a detailed comparison between the presented method and the 
method of Bok etal., the maxzoom light field was reconstructed at the 
same resolution as Bok etal.’s reconstruction. Since the central SAI of 
Bok etal. and the generic method are now very similar, their intrinsic 
parameters should also be comparable, given the same parametrization 
of the light field grid. Table 6.3 shows the camera parameters for the 
central image, as well as the baseline between neighboring SAIs. Both 
times the parameters are very similar, and the optical center (c,,c,) is 
estimated to be close to the center of the respective images. Hence, the 
proposed generic light field reconstruction yields reasonable results. The 
distance between the SAls is similar too, with a slightly larger baseline 
for the generic reconstruction. For Bok etal., this results in a camera array 
of width 35.88 mm, while for the generic reconstruction it results in a 
width of 37.92 mm. This again means that Bok et al. does not capture the 
outermost regions of the main lens, while the presented generic method 
captures a slightly larger area in the parametrization investigated here. 

Even if the parameters of the two methods are very similar, this does 
not mean that the quality of the calibration must be comparable, since 
the reconstruction of the light field is different. By taking a closer look 
at the reconstruction quality of the Lytro Illum reconstruction, it can be 


177 


6 Light Field Reconstruction 


un ur ASINY 


10 x oD a 
oO © oO oO 
bol ~ Fe am 


S rrppppan PaaSA A j jjjaa 


Gen an. i a 


BE, 9999999999999 
95` 9999999999998 
HH 
IIIIIIIIIIIHH 


9999999999099 
99) 4998999999999 
99933 #999999999999 
C9007) -` BRRBRRRRRERER - 
PETTTFF 20T we woourttif or. | 


una 
a 
a 


ee u. 


Bae. ee 72 


t 


amesee- — 


i 
l 


ESRR 


Figure 6.21 Lytro Illum maxzoom setting: Ray re-projection error per pixel for all SAIs. 
Top: LF-reconstruction by Bok et al. [20]. Middle: Generic LF-reconstruction with Cartesian 
angular sampling. Bottom: Generic LF-reconstruction with polar angular sampling. 
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seen that the errors increase the further the SAls are from the center. 
This effect is particularly strong with the calibration of Bok et al., while it 
is much smaller with the generic reconstruction. Figure 6.21 shows the 
comparison. Bok et al.’s reconstruction shows for the central SAI a very 
high calibration quality with small ray re-projection errors. However, 
the quality decreases strongly towards the outer regions. Only about 
the inner 9 x 9 SAls still have an RMSE value smaller than 300 pm. The 
generic reconstruction, on the other hand, shows small errors even up 
to the outer regions of the u, v-plane. Only at the very outer limits does 
the error increase. At the same time, invalid pixels appear. Due to the 
Cartesian sampling of the u, v-plane, areas outside the main lens are 
now also sampled where simply no rays exist. However, these pixels do 
not necessarily pose a problem, since they can simply be classified as 
invalid. To prevent such issues from occurring altogether, one can simply 
use a polar parametrization of the angular coordinates. Thereby only 
the relevant areas of the main lens are sampled and invalid pixels are 
avoided while maintaining a comparable calibration quality. Here again, 
only at the most distant radius values, the error slightly increases. 


6.4 Summary 


This chapter presented a method to calibrate any light field camera (e.g., 
microlens-based, mirror-based, camera arrays) without having to model 
any optical properties explicitly. Utilizing a generic calibration, the indi- 
vidual camera rays were precisely calibrated. Since conventional light 
field-related algorithms require regular sampling, the method trans- 
formed the result into an equivalent light field representation and fitted 
a regular 4D grid onto the irregular camera rays. The summation of the 
weighted intensity values of the rays finally led to the interpolation and 
reconstruction of a rectified light field. Apart from the usual Cartesian 
sampling of the angular coordinates, this chapter presented two possibili- 
ties to sample them in polar coordinates. This proved to be advantageous 
since the light field information can now be represented more compactly. 
Besides the pure reconstruction of the light field’s radiometric quantities, 
a derivation of the intrinsic camera parameters was also presented, i.e., 
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the geometric quantities. The reconstructed light field can therefore easily 
be used in any subsequent application. 

Eventually, experiments showed that the proposed method can provide 
good reconstructions and rectified light fields. The epipolar geometry 
between the sub-aperture images is preserved and even shows better 
results than the conventional state-of-the-art methods. In addition, an 
analysis of the geometric parameters utilizing the ray re-projection error 
showed that the proposed method has a smaller calibration error than 
the state-of-the-art methods from the literature, and thus, it achieves a 
better calibration. While providing very good results for a classical un- 
focused plenoptic camera, the evaluation demonstrated that the generic 
reconstruction works for many kinds of light field cameras and yields 
a highly accurate calibration. For the K|Lens, the generic light field re- 
construction is perhaps not the best solution, since the camera optics are 
in principle not very complex and the generic camera calibration is quite 
time-consuming due to the necessary acquisition of dense features. There- 
fore, simpler models with conventional distortion models would perhaps 
find a similarly satisfying solution for this specific camera. However, the 
results clearly show that the presented generic light field reconstruction 
achieves very high accuracy for any light field camera system, no matter 
if it is microlens-based, mirror-based, or relies on other techniques. 

In summary, both the information of the observed scene and the geo- 
metric structure of the light field are preserved by adequate rectification 
and calibration. And in the end, a better reconstruction of the light field 
and an improved estimation of the camera’s geometrical properties leads 
to better results when used in optical metrology or depth estimation. 
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The deflectometric registration already enables a visual inspection of 
specular objects with the possibility to detect local surface defects or 
to roughly classify shape deviations. However, it is not yet sufficient to 
enable a deflectometric 3D reconstruction. If, in addition, the intrinsic 
calibration of the camera and the monitor as well as the extrinsic calibra- 
tion of the measurement setup are known, a normal field can in principle 
be determined from the deflectometric measurements. However, the 
three-dimensional shape of a specular object cannot be directly deter- 
mined for the time being, even if a calibrated setup is used. As shown 
in Sec. 3.1, a possible surface normal can be calculated for each point 
in the camera’s field of view, so an infinite number of possible surfaces 
could be the cause of the same measurement. Because of this ambiguity, 
it is necessary to use regularization methods that can determine the true 
surface normal and thus lead to an unambiguous solution. The recon- 
struction of the specular surface is usually done in two steps. In the first 
step, the ambiguity of a single deflectometric measurement is resolved 
by considering additional data. The result is an approximate position of 
the surface in terms of points in space and the corresponding normal 
vectors of the surface at these points. Even if a solution for the surface is 
already available through this regularization, its accuracy is typically still 
insufficient for practical applications. Because deflectometry is a slope 
measuring technique, the accuracy of the normal estimate is magnitudes 
higher than the measurement of the depth. The actual specular surface 
reconstruction is therefore performed as a secondary step. Here, the 
low accuracy surface points obtained from the regularization and the 
corresponding high accuracy normal vectors are taken and combined to 
produce a smooth and continuous representation of the surface. 

Since this work deals with light field cameras, Sec. 7.1 describes pro- 
cedures that use the properties of these cameras to enable a regulariza- 
tion of the deflectometric ambiguity. Subsequently, Sec. 7.2 presents an 
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Camera 


Monitor 


s Specular Surface 


Figure 7.1 Ambiguity of the deflectometric normal estimation. Even with a fully calibrated 
system and by knowing the coordinate of the observed reference feature, a potentially valid 
surface normal can be calculated for every point on a camera ray. 


algorithm that fuses the regularization data with the normal estimates to 
obtain a high accuracy surface reconstruction. Finally, Sec. 7.3 evaluates 
the presented methods by using an experimental deflectometry setup to 
reconstruct the shape of different specular objects. 


7.1 Deflectometric Regularization 


As explained before, the deflectometric reconstruction of the normal field 
is ambiguous. Therefore, initially, no unique solution for the specular 
surface can be specified, see figure 7.1. To resolve the ambiguity of the de- 
flectometric measurement, additional regularizing information is needed. 
In principle, it is sufficient to measure only the distance to one point of 
the surface and to reconstruct the surface from the normal field starting 
from this point [11]. Though, if more measurements are available, this 
can help to reduce the influence of a single uncertain and noisy surface 
point. For this purpose, various procedures were introduced in Sec. 3.1.2, 
all of which require a more or less complex system structure. 
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The main focus of this thesis is to efficiently use the special proper- 
ties of the light field camera to enable a deflectometric reconstruction 
of the surface. In this section, two methods are presented in which the 
light field camera can be directly used to obtain additional information 
about the surface, which has a regularizing effect on the deflectometric 
measurement. 


7.1.1 Light Field Depth-Based Regularization 


Since the light field camera can partially capture the light field of the 
observed scene, it can extract much more information than a standard 
camera. The additional information, in contrast to standard cameras, 
allows changing the perspective on the scene after the exposure, thus 
enabling depth information to be extracted. 

The depth of a diffusely reflecting scene, i.e., the distance of an ob- 
served object point, can be determined by analyzing the light field’s 
geometric structure, i.e., the slope of the epipolar lines in the EPIs (cf. 
Sec. 3.2). The light field camera can therefore be used as a compact pas- 
sive 3D camera, meaning that structured illumination is not required. 
When surveying partially reflecting surfaces, the special properties of 
the light field camera allow finding depth features on the direct surface 
as well as determining the depth of the reflected scene. These indepen- 
dent measurements can be used as an additional source of regularizing 
information for deflectometry. 

In the following, it is demonstrated how the depth estimation of the 
light field camera can be used to solve the ambiguity problem of the 
deflectometric normal reconstruction. 


7.1.1.1 Direct and Indirect Depth Estimation 


The depth estimation of light field cameras allows to find candidates for 
possible surface points, and it thus makes it possible to resolve the ambi- 
guity of the deflectometric normal estimation. In practice, two situations 
arise, see figure 7.2. First, if the surface of the measurement specimen 
has diffusely reflecting regions, a standard light field-based depth esti- 
mation can be used to directly measure the distance between the camera 
and the surface for each pixel (or camera ray), see Sec. 3.2. The set of 
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Figure 7.2 Direct depth estimation detects diffuse features on the surface. Indirect depth 
estimation estimates the depth to the reference monitor and calculates the depth to the 
surface using the known measurement setup. 


depth values Z4irect thereby estimates the distance to the surface. With 
the intrinsic parameters of the light field camera and the forward projec- 
tion model (6.27), this depth can be transformed to a corresponding ray 
length, which can then be directly used to regularize the deflectometric 
normal estimate: 


Sdirect= IlS(Zairect) | Á (7.1) 


If the measurement sample is fully specular, the light field camera is not 
able to directly determine the distance to the surface. The real surface is 
virtually invisible. For plane mirrors, the camera will instead estimate the 
distance to the reflected reference scene. The resulting ray length becomes 


Sreflect — sl + lls, (7.2) 


Nevertheless, with the help of the knowledge about the calibrated de- 
flectometric measurement setup and with the registration of camera rays 
to monitor pixels, the direct distance to the surface can be calculated 
from the indirect depth measurement. It follows with the deflectometric 
measurement p = s + s, and the depth estimate s,crect = ||s|] + lls, |: 


Is, I? = Ip - sl? = |p|? — 2p*s + |s|? , (7.3) 
Is, |? = (Sreflect _ Isi)? = En = 2 Sreflect Is F Is]? 3 (7.4) 
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Equating (7.3) and (7.4), and using S = ia gives an estimate Sindirect for 
the distance to the surface: 
2 2 
A 1 Ipli — S eflect 
Sindirect= [S|] = Sates R (7.5) 
PS Sreflect 


However, it must be mentioned that (7.2) is valid only if the mirror sur- 
face is sufficiently flat or if the observation angle is chosen appropriately. 
Investigations of Criminisi et al. [41] and Swaminathan et al. [195] have 
shown that the measured length appears compressed or stretched in 
contrast to the true length depending on the surface shape and the mea- 
surement configuration. That is, in reality, the estimated depth becomes 


Sreflect T sl + a7! lls, , with a = 1 +2 lls, kK cos(3) , (7.6) 


where the multiplicative factor a in the depth estimate is affected by 
the distance of the reference scene ||s,||, the incidence angle 8 between 
camera ray and surface normal, and the curvature of the surface « . Here, 
the curvature is measured relative to the “direction of motion” of the 
camera, which corresponds in a light field camera to the direction of 
the used EPI. That is, different EPIs may provide different depth esti- 
mates. Since it is not possible to estimate the values for x and 6 without 
further knowledge about the surface and the measurement setup, the 
only solution to this issue is to detect regions of strong curvature and to 
exclude them from being used for regularization. Even though the sur- 
face cannot be reconstructed unambiguously in deflectometry without 
prior regularizing data, indications about the curvature of the surface 
can still be obtained directly from the deflectometric measurement. With 
an increase in local surface curvature, the directional derivatives of the 
registration data increase as well [100]. Hence, a simple second-order 
gradient calculation with subsequent thresholding allows the detection 
of high curvature regions. 

The two-fold depth estimation presented in this section can in princi- 
ple be performed at the same time due to the special properties of the 
light field camera. When light field cameras observe partially reflecting 
or transparent objects, the resulting light field can be interpreted as a 
superposition of two individual light fields. For classical stereo camera 
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systems, this is usually troublesome and results in erroneous depth es- 
timates. For light fields, however, an analysis of the EPIs now shows 
a superposition of the line-like structures as well [96]. A simultaneous 
estimate of both orientations thus provides a depth estimate of both the 
partially specular object and the reflected scene. Methods for estimating 
these depths use, e.g., higher-order structure tensors or optical flow [217, 
232]. In practice, however, it became apparent that this simultaneous 
depth estimation is not suitable for deflectometric regularization and 
that a sequential estimation leads to better results since the task is sim- 
plified. When examining surfaces with diffuse components, it is only 
necessary to take an image where the monitor is completely white, and 
thus, the surface is sufficiently well illuminated. Since the reflection (the 
monitor) now contains no structure, the depth estimation algorithm will 
only detect features directly on the surface. To subsequently measure 
the distance to the reflected monitor, it is possible to perform a depth 
estimation directly on the registration data. This means that in this case 
the light field does not contain color information, but each light field pixel 
is assigned the 2D coordinates of the observed monitor pixel estimated 
via phase-shift coding. Using this as a direct image feature is advisable 
because then image noise is drastically reduced, enabling a more robust 
depth estimation. 

In summary, for partially specular surfaces, the light field camera can 
obtain two separate depth estimates. However, most ofthe classical depth 
estimation algorithms (including the ones based on CNNs) only provide 
the depth of the central SAI [98, 187], since it yields the most accurate 
results. Further, many algorithms provide an additional confidence esti- 
mate for the depth [18, 199]. Hence, for any partially specular surface, 
the direct depth estimate 24... with confidence Cairect is obtained. Ar- 
eas with high confidence are caused by a structured surface, while low 
confidence implies areas with little structure or even fully specular areas. 
For planar mirrors, the indirect depth estimate 2, „dire. can be obtained 
with confidence ¢;,4irec, - IN contrast to the direct depth estimation, the 
confidence is hereby lower for diffusely structured surface areas, while 
fully specular areas have higher confidence. For non-planar mirrors, the 
confidence measure is also affected by the curvature. 
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Figure 7.3 Principle of stereo deflectometry: A deflectometric measurement induces two 
independent normal fields in the fields of view of the cameras. On the true surface, the 
surface normals measured in both cameras must coincide. 


7.1.2 Light Field Multi-View-Based Regularization 


A shortcoming of the regularization method from the previous section is 
that it only estimates the depth of the central SAI and does not provide 
the depth for the other SAIs. However, the major disadvantage of depth- 
based regularization is that it only works for special surfaces. This means 
that initially it cannot be used to measure fully specular and curved sur- 
faces. To be able to measure such surfaces as well, this section introduces 
a combination of the principle of multi-stereo deflectometry with light 
field cameras to obtain accurate regularization points in each SAI. 

In (multi-)stereo deflectometry, the surface is observed by at least 
one additional camera. In contrast to the classical stereo vision and the 
depth estimation of diffuse surfaces, on fully specular surfaces there is 
the difficulty that no direct point correspondences can be found since 
initially only virtual features are captured in both cameras. That is, pixels 
from the cameras observing the same surface point will see different 
points in the monitor plane. However, specular stereo can be achieved 
by correlating the normal vector fields induced by two measurements, 
where the true surface can be found in the intersection of both solution 
manifolds. Hence, an indirect surface triangulation can be achieved with 
the following: In the field of view of the first camera a three-dimensional 
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normal field n, is induced by a deflectometric measurement. The second 
camera with a different field of view on the test object provides another 
normal field n,. Thus, for each point in the intersection of the fields 
of view, two candidates for surface normals can be calculated. On the 
real test surface, these normals must coincide n; = n, . For points that 
are not on the surface, one usually observes a deviation of the normal 
directions [10]. Figure 7.3 illustrates this principle. 

A very basic algorithm for surface reconstruction is to determine along 
a search direction the points where the two normal directions coincide 
best. These points regularize the deflectometric ambiguity and repre- 
sent possible surface points. The normals determined in this way are 
the corresponding surface normals. The stereo principle can be easily 
extended to a multi-view approach. And since the light field camera can 
be interpreted as a multi-camera array, a light field multi-view-based 
regularization can be easily implemented, where surface points can be 
found for each SAI. 


7.1.2.1 Regularization by Normal Disparity Minimization 


To be able to quantitatively evaluate the similarity of the measured surface 

normals for each point in space, a suitable distance measure, the so-called 

normal disparity, has to be defined. A disparity measure that is widely 

used in the literature is the variance of the normal field in the observed 

surface point under consideration [10, 21]. This can be obtained by first 

calculating the wer of the normal estimates corresponding to every 
N 


VIEW Nmean = TO n,,, and by subsequently calculating the mean 


angle between this mean normal and the individual normals: 


N 
J(s) = > arccos (AN (s)À mean (3))” » (7.7) 
n=1 


where ù = inl indicates a unit vector, and where all normal estimates 
obviously depend on the examined points. 

If additional information about the quality of the individual measure- 
ments is available, it is reasonable to use weighted averages instead of 
plain averages. The quality of the estimation is influenced by two factors: 
the uncertainty of the deflectometric registration, i.e., the phase uncer- 
tainty o, and the inherent accuracy of the camera calibration, i.e., the 
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residual calibration error £. Thus, a weighting factor combining both 
factors can be provided for every camera pixel, or in the case of the light 
field, it is available for every ray §,,,,(s,t) of each SAT: 

1 


Wyy(s,t) = =. 7.8 
welt) oz (u,v, s, t)e2(u, v, s, t) ve) 


For the sake of brevity, the dependence on the individual light field pixels 
is omitted in the following, as long as it does not impede understanding. 
Hence, by interpreting the light field as a camera array, for every spatial 
pixel (s,t) , the objective that needs to be minimized becomes 


N 2 
J(Siy) = ee ER: >> Wy ATCCOS (at un Wlan (Sun) en h 
ee Wuv u,v ee | 
(7.9) 
To find the surface, it is now necessary to search the entire measure- 
ment space for the regions with minimum normal disparity. To avoid 
discretizing the measurement space with unnecessarily high resolution, 
and to prevent a too coarse representation of the surface as well, initially, 
no continuous parametrization of the surface is sought. Instead, the exact 
resolution of the camera is used, and the optimal distance to the surface 
is searched for each camera pixel, i.e., for each ray. As a consequence, 
the minimization of the normal disparity along each ray depends on 
only one parameter: the length of the ray or rather the depth z of the 
corresponding point s(z) . Moreover, each ray can be considered individ- 
ually, which allows the optimization to be performed in parallel. For each 
pixel, respectively for each camera ray, one obtains the one-parametric 
optimization problem 


z=argmin J(s(z)) . (7.10) 

To evaluate J(s) and to calculate the disparity, a few intermediate 
steps are necessary. First, starting with a single discrete light field pixel 
(u, v, s, t) , the corresponding point in space s,,,,(z) must be determined 
according to the current evaluated depth z. For this purpose, the spatial 
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pixel (s,t) of the current SAT is lifted into space by using the camera 
intrinsics from Sec. 6.2.3: 


s 
S (2) = 2: K7} (i) tun: (7.11) 
1 


Subsequently, with the help of (6.26), the same point is then projected 
back onto the virtual sensor planes of all other SAIs with the angular 
coordinates (ù, v): 


(=) = Kj, (Sw (2) + tas) - (7.12) 


With the help of the projected light field pixel coordinates, the respective 
deflectometric measurement can be obtained consisting of the measured 
monitor coordinate that is transformed to the camera coordinate system 
and the value of the respective weighting factor as well: 

Pas = Rx(U, V, Sig tas) +t, (7.13) 


UV ? “UVI “UV 


= w(U, U, Saz tag) - (7.14) 


’ 2° UV? “UD 


Waa = 
Due to the possibility of non-integer spatial pixels (s,;, tas) , intermediate 
values are calculated by means of bilinear interpolation. In the final 
step, the surface normals are calculated using the surface point under 
consideration s,,,,(z) , the through phase-shift coding measured monitor 
points p,,, and the respective camera rays §,,, for all SAIs (including u, v): 


n,,(2) = Pas Em) — Sas- (7.15) 
[Pas — sw(2)| 
Using these steps, the normal disparity (7.9) can be calculated for the 
pixel (u, v, s, t) at the depth z. 

The one-parametric optimization problem (7.10) can now be optimized 
along the individual camera rays using a line search algorithm. Since 
the computation of the normal disparity is costly, gradient-free methods 
such as Brent’s method are suitable for this purpose [30]. This method 
combines golden-section-search with parabola approximations, and it 
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converges in the ideal case with a quadratic rate to the optimum. In each 
evaluation step of the optimization, the normal disparity (7.9) must be 
calculated for the current depth value z. And due to the independence 
of the individual camera pixels, the corresponding depth values can 
be easily optimized in parallel. However, since a fully convex objective 
is required for a correct optimization, a few issues arise regarding the 
disparity minimization. In general, the depth-dependent normal dispar- 
ity has at least two minima. One appears at the surface. Another one 
emerges for z — oo, which is due to camera rays being gradually more 
parallel to each other for greater distances and thus the surface normals 
being calculated to become more equal. An incorrect initialization could 
therefore lead to an erroneous depth estimate [200]. For particular imag- 
ing configurations and concave surfaces, the issue becomes even worse, 
since then the objective may even show multiple minima. More precisely, 
different surfaces can be generated which cannot be distinguished even 
with a stereo approach [221]. To solve these difficulties, prior knowledge 
about the distance to the surface must be used, and the minimization 
must be constrained by boundary conditions. Consequently, the final 
optimization problem is obtained for each pixel, respectively for each 
ray, where the search space of the depth z is constraint by a convenient 


choice of bounds [2 „in; Zmax| that avoids incorrect minima: 


z= argmin J(s(z)). (7.16) 


zelz 


min»Zmaxl 


Hence, with the same notation as for the other regularization points, 
the depth map zu is obtained and can be used to regularize the de- 
flectometric normal measurement. Alg. 3 summarizes the multi-view 
disparity minimization. 


7.2 Surface Reconstruction 


In principle, the regularization points which can be found with the meth- 
ods from the previous sections can be used directly to reconstruct the 
surface, for example, by calculating an average. However, since multi- 
stereo measurement systems such as the light field camera are limited in 
their measurement quality by the width of the effective stereo baseline, 
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Algorithm 3 Light Field Multi-Stereo Deflectometry 
Input: Registration data, camera intrinsics, relative pose 
Output: Depth and surface normal with minimal normal disparity 
Initialize: Set min and max distance 
1: for (u,v, s, t) € [0, N, — 1] x [0, N, — 1] x [0, N, — 1] x [0, N, — 1] do 
2: Get first depth value (using Brent’s method) 
3 Z := 2,,(5,t) — Brent (Znin; Zmax) 
4 while Disparity J is not yet sufficiently small do 
5: Project ray to world coordinates with for depth 


s 
6: Sw (2) = 2° K; () — tuv 
1 


7: Calculate disparity and surface normal 

8: for (&, ©) € [0, N, — 1] x [0, N, — 1] do 

9: Transform to SAI-pixel coordinates 

Suv 
10: (=) = “Kis (Syy(z) + tan) 
1 
11: Get corresponding monitor coordinate and weight factor 
12: as = Rx(ù, U, Saz, tas) +t 
13: Wag = W(U, U, Sag, taz) 
14: Calculate nel 
= nal) = Genen] T Fo 
16: end for g 5 
17: ee S (ar De) 
au ER ee we ww od] 

18: Calculate next depth (using Brent’s method) 
19: Zuy(S,t) + z + Brent(J, zZ, Zins Zmax) 
20: end while 
21: end for 


22: return z,,,,(5,t),1,,,(5,¢), Jy,(s, t) 


2“ uv 
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one does not always achieve the desired accuracy with this kind of depth- 
based regularization. In contrast, deflectometry measures slopes, or 
rather surface normals, with precision several orders of magnitude higher 
than the depth, but requires information about surface points for regular- 
ization [51]. Itis therefore useful not to rely solely on depth estimation. In- 
stead, the dense deflectometric measurements of the surface normals can 
be fused with the various regularization points, which may also be only 
sparsely available. In doing so, an optimal surface is found whose nor- 
mals coincide with the deflectometrically measured ones and which has a 
minimal distance to the calculated regularization points at the same time. 


7.2.1 Surface Reconstruction by Depth and Normal 
Fusion 


Starting from a single known surface point, the surface can be inte- 
grated from the normal field [100]. However, classical region-growing 
approaches propagate both the measurement and discretization error 
along the integration path [50]. In addition, a major challenge is that 
typically in practical situations the normal field is corrupted by noise and 
is therefore almost never integrable and curl-free. Due to this, variational 
approaches are often used where only the integrable part of the normal 
field is considered and the integration task is formulated as a minimiza- 
tion problem [164]. The general approach of normal field integration 
can be formulated as an optimization problem as follows: Find the set 
of surface points s € S for which the functional E : S > R 


B(s) = | Jn —m(s)I? de (7.17 


with surface element do and surface normal n takes a global minimum. 
That said, since in deflectometry the measured normal n,,(s) depends 
on the surface s itself, there exist infinitely many solutions that mini- 
mize the above functional [9]. To find the true surface from the infinite 
manifold of surfaces, regularization points have to be included in the 
optimization. The surface reconstruction can again be modeled by energy 
minimization: 


argmin | Jn —n,,(s)|? +) |s—s,)? do. (7.18) 
syn IS i 
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Thus, the searched surface should have minimal distance to the regu- 
larization points s; and at the same time the difference of the surface 
normals to deflectometrically measured normals n, should be mini- 
mized. This resolves the ambiguity of deflectometry and results in an 
overall more robust result for the 3D reconstruction. Though, for prac- 
tical implementation, the functional needs to be discretized and adapted 
to the available data. For the light field data available in this work, the 
depth and normal measurements are located pixel-wise on a discrete 
grid and the different perspectives of the light field camera are very close 
to each other. Hence, it is not necessary to search for a general solution 
of the functional (7.18) in an unconstrained 3D space. Instead, the deflec- 
tometric surface reconstruction is formulated here as a discrete gradient 
integration, and the measured surface points are projected onto a depth 
map z(s,t). In addition, the corresponding surface gradient g(z) of this 
depth map is calculated from the depth-dependent normal field n(z) . 
Depending on whether perspective or orthographic projection is used 
different formulas have to be used for this calculation, cf. Sec. 2.4 

Since (7.17) is an ill-posed problem, minimization would not yield a 
meaningful result. By adding additional regularization points a unique 
solution can be found, but since the coupling between the normal and 
the surface points is rather weak, and since the regularization points 
may also only be sparsely available, it makes sense to make further reg- 
ularizing assumptions to simplify the optimization [8]. In many areas 
of image processing, Total Variation (TV) is used as a popular regular- 
ization method because it can handle discontinuities in the data while 
smoothing noisy measurements [33]. However, it has the disadvantage 
that linear changes in an intensity profile can form unwanted staircase- 
like structures after optimization. In depth maps, such intensity changes 
correspond to a change in depth, e.g., tilted planes, which are by no 
means uncommon. Therefore, in the field of 3D reconstruction, the TV 
has the serious disadvantage that such surfaces cannot be reconstructed 
correctly. In contrast to TV, Total Generalized Variation (TGV) avoids this 
effect by allowing higher-order solutions [28]. 

Thus, the continuous functional (7.18) is first discretized and the TGV 
is used as an additional regularization. And similar to the TGV-based 
image fusion of Pock et al. [157] and the normal fusion of Antensteiner et al. 
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[8], a discrete optimization problem that enables a surface reconstruction 
through a fusion of depth and normal measurements can be defined as 


arg min X` w; |z — zl? + Wm Ig — 8m (ZI? + TEVA (2,8). (7.19) 
2,8 7 


Here, z; corresponds to any regularizing depth estimates, gm(2) calcu- 
lates the gradient for given depth-dependent normal estimates, w; and 
w,, are weights, and z and g are the sought surface and surface gradient, 
respectively. Further, the TGV term can be expressed using the gradient 
operator V and a symmetrized derivative operator € = (V +V"): 


TGV (2,8) = a |V2 — gl + a l£slı - (7.20) 


The purpose of the TGV term is that it strengthens the coupling be- 
tween the direct estimation of the depth z (respectively surface s) and the 
estimation of the surface gradients g (respectively surface normals n) by 
minimizing the distance between the gradient field Vz calculated from 
the depth map and the gradient field of the surface g. In addition, g is 
forced by a data term to stay in the proximity of the deflectometrically 
measured gradient g,, . At the same time, a deviation of the surface from 
the depths z; is penalized. The choice of ag > 0 causes a smoothing of 
the gradient field and reduces the influence of noise, and in addition, 
it implicitly helps to fill holes in the data if gradient information is not 
available at all locations [28]. 

Problem (7.19) is convex but discontinuous due to the L,-norm. There- 
fore, as explained in Sec. 2.5, it is necessary to reformulate it as an equiv- 
alent convex-concave saddle point problem. This formulation is applied 
by dualizing only the TGV term and considering the depth and normal 
data terms as regularization functions 


Giz) = $ wile- zl?» G2(g) = wa le-sm(ZI?. 72) 


The convex conjugate of the weighted L,-norms a; ||-||, and a; |-||, con- 
tained in the TGV term are calculated to [34] 


0 < < 
yogs O Bales = > olo se 
! %, lylo > 40 7 "8 00; Yolo > aı 


22) 
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Algorithm 4 a.ı a a 


Initialize: 2“) = z! = + i ae ) = g” = ga (2), 
yf r Sg, = €g() 
1: for n = 1,2,3,..., nmax do 


2: Proximal gradient ascent in the dual variables 


n+l n An —(n 
3: yí ) — PTOxs,. (y ) al: Ty, (va |e gí )) 


4 yo = Proxy, (v0 + ty, Eg”) 

5: Update deflectometric surface gradient 

6: En & g.(2”) 

7: Proximal gradient descent in the primal variables 
8 


0+1) _ proxg, (z (n) — r, divy ya) 
9: girth) = proxg, (e+, x (dive yo yr) 
10: Extrapolation 
11: zin+l) = 9 ,(n+1) = 2) 
12: girth) = Zelt) = g(”) 
13: end for 


At last, with the help of the dual variables y, , yọ the discrete saddle 
point problem can be formulated as 


nn (Vz - 8,y1) + (Eg, Yo) + Gi(z) + Go(g) — by, (Y1) — dy, (Yo) - 

(7.23) 
The individual variables are scalar, vector or tensor fields parameterized 
by the spatial pixel grid w;, Wm, 2,2; € RYM g, gm, y1 € RN N, 
yo E€ RN N or scalar weighting factors ag, œ ER. 

Using the divergence operators divy , dive that are adjoint to V , € [27], 
the optimization of the saddle point problem can be solved by iterative 
gradient descent in the primal variables z, g and gradient ascent in the 
dual variables y, , yo [34]. And, as explained in Sec. 2.5, a correspond- 
ing primal-dual optimization scheme can be derived. Alg. 4 shows the 
optimization algorithm. 

Since the deflectometrically measured normals depend on the distance 
to the surface, the measured gradient field g,,(z) is updated in each 
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iteration. The proximal operators can be derived by solving the separate 
problem (2.38) and can be stated in closed form [34]: 


; Er val u &+27,), We; 
prox, (¥1) = max(1, 2) , prox (2) = 1427, ,u, , 
a : (7.24) 
yo g+ PRU a 


proxs, (Vo) = , prox (8) = 


max(1, Žal 14277 Wm 

&0 

While the presented reconstruction algorithm is still very general, it 
can be applied directly to the light field camera data. In principle, two 
very general approaches can be considered: multi-depth reconstruction 
and multi-view reconstruction. 


7.2.2 Multi-Depth Reconstruction 


In the multi-depth approach, different depth maps are used for regu- 
larization and are combined to make the initial depth estimate more 
robust. For light field-based depth estimation, most of the time only the 
depth for the central SAI is available. Therefore, only the three central 
depth maps z; from Sec. 7.1 are used with i € {direct, indirect, multi}. 
As explained in Sec. 2.4, due to the perspective projection occurring in the 
light field camera, a variable substitution must be performed so that the 
surface gradient can be determined from the measured surface normals. 
By transforming the depth maps 


zils, t) := In(z;(s,t)), (7.25) 
the surface gradient corresponding to this substitute surface can be easily 
calculated from the deflectometrically measured normal 

fy», (2) = By», (exP(2)) = (n1, na, n3)” (7.26) 
as a function of the given depth: 


nı ng 


T 
En (2(8s,t)) = — (ea an BETEN er a en , (7.27) 


where the normal is obtained from (7.15). In order to model the perspec- 
tive projection, the intrinsic camera parameters c, := c,(u.) , C= &(U_), 
f,, fı from Sec. 6.2.3 are required as well. 
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Since most depth estimation algorithms provide a confidence measure, 
this can directly be used as weighting factor W4irect and Windirect And the 
inverse of the normal disparity is used to calculate w,,,,,,; . Also, because 
the deflectometric normal estimation is several magnitudes more accurate 
than the depth estimation, w, is selected to be about 100 times larger than 
the average of the other weights. After finding a minimum for (7.19), the 
true surface can be derived by back-substitution from (7.25) to z = exp(2). 


7.2.3 Multi-View Reconstruction 


A disadvantage of the naive multi-depth approach is that only the depth 
estimate for the central SAI is considered, although all other SAIs could 
also contribute to the reconstruction of the surface. Consequently, the 
lateral resolution of the reconstruction is limited by the spatial resolution 
of the central SAI. Furthermore, the depth estimation-based regulariza- 
tion approaches are only applicable to a very limited group of surfaces. 
In contrast, multi-view regularization through normal disparity min- 
imization can be applied to more diverse surface types and provides 
regularization information in each SAI. 

The individual depth maps z,,,,(s,t) are initially defined on differ- 
ent virtual sensor planes, therefore they have to be transformed into a 
common grid. To use the multi-view information to increase the lateral 
resolution of the reconstructed surface, the individual depth estimates 
are transformed into a new grid, which does not need to be limited by 
the spatial resolution of the central SAI. In this case, the perspective pro- 
jection does not need to be modeled and instead, the grid can be defined 
by an orthographic projection. Consequently, all depth maps z,,,,(s,t) are 
transformed to point clouds s,,,, using (6.26) and are then orthographi- 
cally projected onto a new common grid z(5, t) , where the grid should be 
designed to enclose all relevant surface points. Alternatively, multi-view 
regularization could be performed directly on a pre-defined orthographic 
grid. That is, instead of minimizing the normal disparity for each camera 
pixel, the disparity for each grid point can be optimized along the depth. 

During the optimization, the surface normals corresponding to each 
depth value are obtained by transforming the depth map back to a point 
cloud, calculating the normal estimate for each SAI using (7.15), and by 
taking the average over all estimates. Because an orthographic projection 
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is used no variable substitution needs to be performed. The surface 
gradient can be calculated from the deflectometrically measured normal 
estimate n(z) = (n4, no, ng)” as a function of the given depth: 


Bn (2(&;1)) = s = (7.28) 


For the weight factors, the inverse of the normal disparity is used to 
define w;, and the weight of the surface gradients wm is selected to be 
about 100 times larger considering that its accuracy is higher as well. 


7.3 Evaluation 


The next sections examine the steps necessary for specular surface recon- 
struction and analyze the presented procedures. The experimental setup 
that was used to conduct the deflectometric measurement is shown in 
figure 5.17. A 27” monitor with a resolution of 2560 x 1440 px and a pixel 
pitch of 233 um was used to display the necessary phase-shift patterns. 
For image acquisition, the Lytro Illum light field camera was employed. 

Since the light field camera can be interpreted as a multi-camera array, 
a multi-view approach for specular surface reconstruction is pursued in 
this work. The measurement setup was therefore designed to provide the 
most ideal conditions for this measurement principle. For the multi-view- 
based regularization to find a distinct minimum in the normal disparity, 
the normal field must exhibit substantial variability. According to the find- 
ings of Werling [223], this can be achieved with small camera-to-object 
and monitor-to-object distances, and an angle between camera/monitor 
axis and mean surface normal of about 45°. The monitor and camera are 
therefore tilted 90° to each other, and the specular objects are placed at 
a distance of about 30 to 60 cm in the camera’s field of view. In deflec- 
tometry, the choice of the focal plane influences the reconstruction. If the 
camera focuses on the surface, its lateral resolution is maximized, but 
the monitor is blurred, which increases the uncertainty of the reference 
feature and leads to a less favorable estimation of the surface normal. 
When focusing on the monitor, the slope estimate is ideal, but surface 
features are blurred, which degrades the effective lateral resolution of 
the reconstruction [223]. As a compromise, in the experimental setup of 
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this work, the camera is focused on an area slightly behind the surface. 
Nevertheless, since the Lytro Illum is an unfocused plenoptic camera, 
the choice of the focal plane is relatively insignificant, since the camera’s 
depth of field is very high. 

Furthermore, phase-shift coding was used to obtain reference features, 
with M = 12 shifts and frequencies f = (1, 4, 16, 64). The probabilistic 
approach from Ch. 4 was used for phase unwrapping unless specified 
otherwise. The light field camera was calibrated using the methods from 
Ch. 6. Hence, for each deflectometric measurement a light field containing 
the encoded monitor data is retrieved, where the light field resolution 
is set to (N,,,.N,,, Ns, Ni) = (13, 13, 434, 625) . The extrinsic calibration of 
the measurement system was conducted using the calibrated light field 
camera and the methods from Sec. 5.4. 

For the analysis of the reconstruction accuracy, different reference 
samples were examined. Because their shape is known, this can be used 
to evaluate the reconstruction accuracy of the presented methods by 
calculating the distance between the reconstructed surface z and the true 
surface zgr. For this, the true surface is first fitted onto the reconstructed 
data and then the depth values are compared. Two error metrics are used: 
the root-mean-square error and the peak-to-valley ratio 


RMSE = 4/ Mean (|z — zar|?), (7.29) 


PV = |max (z — zor) — min (z — zar)| , (7.30) 


where both metrics are calculated over all valid surface points. 


7.3.1 Regularization 


A partially specular surface is necessary for the evaluation of depth 
estimation-based regularization. For this purpose, a disk from a hard 
drive was used as a reference sample, which shows partially reflective 
areas in the form of color markings and scratches. For the presented 
regularization methods, the surface must be coded by structured illumi- 
nation. This allows not only to estimate the monitor coordinates but also 
to obtain the associated coordinate uncertainty. Since the uncertainty 
increases dramatically for non-specular or weakly reflective areas, this 
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(b) Horizontal monitor coordinate. 
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(c) Coordinate uncertainty. (d) Mask. 


Figure 7.4 Partially specular disk: The mask is calculated by thresholding the uncertainty 
estimation. The uncertainty increases near scratches and color markings. 


can be used as an indicator for the relevant surface areas. Therefore, a 
threshold on the uncertainty provides masking of the data. 

Figure 7.4 shows the disk, the measured vertical monitor coordinates, 
the coordinate uncertainty, and the resulting calculated mask. Even non- 
specular components of the background provide registration data. How- 
ever, these points can easily be removed by using the masking. As ex- 
pected, the uncertainty is larger for the diffuse components of the surface 
than for the completely specular ones, but it is still much smaller than the 
areas outside the disk. Thus, the reconstruction of the specular surface is 
performed only for those pixels that observe the disk. 
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Figure 7.5 Depth estimation and corresponding confidence measures. (a) & (d) Direct 
depth estimation. (b) & (e) Indirect depth estimation. (c) & (f) Multi-view regularization. 


(d) 


7.3.1.1 Depth estimation 


For the direct depth to be measured, no structured illumination is nec- 
essary, and instead, the monitor has been turned completely white for 
adequate brightness. Due to the roughness of the mirror and the color 
markings, a classical structure tensor-based orientation estimator was 
used for the depth estimation [218]. This approach provides the disparity, 
i.e., the slope of the lines, which can be converted into the distance to 
the disk. Further, it also yields an additional confidence measure for 
the estimated depth. The confidence is high in the vicinity of structured 
image areas where the lines in the EPIs are visible. If there is no structure, 
there are no lines in the EPIs, which results in a low confidence. For the 
indirect depth estimation, phase-shift coding was used to assign the hori- 
zontal and vertical monitor coordinates to each light field pixel. The same 
algorithm can be used for indirect depth estimation. The only difference 
is that there are only two “color channels”. Since the method does not 
perform correctly near strong curvature regions, second-order gradients 
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are calculated on the registration data. A final confidence measure is 
then obtained by combining the confidence of the depth with the inverse 
of the calculated curvature. After estimating the monitor’s depth, the 
indirect depth can easily be calculated by using (7.5). Figure 7.5 shows 
the estimates of the surface as a point cloud using the different methods 
as well as the corresponding confidence measures, which are used as 
weighting for the subsequent surface reconstruction. For comparison, 
the multi-view regularization is shown as well, where the inverse normal 
disparity is used as a confidence measure. The figure shows that the 
direct depth estimation is very noisy because the surface itself has only 
a few areas with structure. This can also be seen in the corresponding 
confidence map, where only the areas near the color markings and the 
edge of the disk show high confidence. The indirect depth estimation is 
much less noisy since phase-shift coding suppresses image noise. The 
confidence map is also much more consistent. Yet, the confidence de- 
creases in the vicinity of dents on the surface, as the curvature increases 
here. The multi-view depth estimate looks the best. The associated confi- 
dence values are higher on the fully specular areas than near the color 
markings. Outside the specular disk, it is zero since these areas are not 
examined due to the masking. 

The major disadvantage of direct and indirect depth estimation is 
that it only works for very specific surfaces. If the surface is completely 
specular, no direct surface features can be detected. If the surface has cur- 
vature, the depth of the reflection is compressed or stretched. Figure 7.6 
shows this behavior for the reconstruction of a convex mirror. While the 
multi-view regularization can reconstruct the surface, the indirect depth 
estimation fails completely, even though the surface has only a very small 
curvature with « = 1/800 mm !. The position of the surface is in some 
cases even estimated to lie behind the camera. In conclusion, the indirect 
depth estimation may only be used for planar surfaces or needs further 
improvements. Therefore, for the time being, it should be considered 
only as a theoretical concept and interesting approach and should be 
handled with caution for practical use. 
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Figure 7.6 Reconstruction of a convex surface: (a) The indirect depth estimation fails 
even for surfaces with only marginal curvature. (b) The multi-view regularization correctly 
estimates the surface depth. 


7.3.1.2 Normal Disparity Minimization 


Minimizing the normal disparity requires neither diffuse surface fea- 
tures nor triangulating the distance to the monitor. Instead, an arbitrarily 
shaped surface can be found by triangulating the normal field. The in- 
spection of a planar surface and a concave surfaces are shown in figure 7.7. 
The figure shows the reconstruction of the disparity of a camera pixel as 
a function of the distance to the surface. 

For both surfaces, the disparity increases strongly for decreasing dis- 
tances, so that the lower bound of the optimization problem (7.16) can 
be defined without problems. The disparity of the planar surface shows 
a clear minimum, and it can be seen that the disparity decreases as the 
distance approaches infinity. It has a local maximum at a distance of 
about 60cm. The upper bound for the optimization can therefore be set 
very loosely since the measurement space of the experimental setup is 
only slightly larger than 60 cm. The true minimum can therefore be found 
easily. For the concave surface, two dominant minima emerge. As already 
explained in Sec. 7.1.2, this is a peculiarity of concave surfaces, such that 
for stereo deflectometry there are surfaces where the disparity shows 
equivalent minima at different distances. Fortunately, this is not the case 
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Figure 7.7 Multi-view regularization: The plots show the reconstructed point cloud and 
the normal disparity J (z) as a function over the distance z. (a) & (c) Planar surface and the 
disparity of a pixel. (b) & (d) Concave surface and the disparity of a pixel. 


for the multi-stereo approach investigated here, and a clear minimum can 
still be seen. However, it is much more difficult to define the upper limit 
of the search space, since the disparity of the investigated pixel shows a 
local maximum at a value of just over 55 cm. There is only a distance of 
less than 15 cm to the true minimum. Depending on how strongly the 
surface is inclined, the disparity curve for some pixels is thus shifted 
further to the right or left. In the worst case, the minimization wanders 
for some points into the second minimum. However, since the associated 
disparity is much larger than the one from the true minimum, these 
erroneous estimates can still be eliminated in post-processing. 
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The normal disparity can be interpreted as the variance of the angle 
between the normal estimates. The square root of the disparity thus gives 
information on how large the spread of the angles is in the point under 
consideration. For the plane mirror, the minimum of the square root of 
the disparity is VJ = 22 prad and the local maximum is VJ = 3 mrad. 
For the concave mirror, the minimum is VJ = 150 prad and the local max- 
imum is VJ = 1 mrad. These very small values are due to the very small 
baseline between the SAls. In a standard stereo-deflectometry system 
with the same baseline, the same disparities would be technically indis- 
tinguishable because they would be superimposed by noise. This would 
make reconstruction impossible [221]. The light field-based multi-view 
approach with 13 x 13 SAls can still resolve the small disparity range 
despite the small baseline because the multiple views allow a reliable 
disparity estimation. 

While minimizing the disparity already yields surface points, the 
resulting surface is still not perfect. This is because the triangulation of the 
normal field, like other triangulation methods, depends on the effective 
stereo baseline and the distance to the surface, where the uncertainty of 
the depth estimate increases quadratically with the depth [72]. Thus, for 
better reconstruction, the normal measurement should be used. 


7.3.2 Multi-Depth Reconstruction 


To demonstrate the principle approach and the advantages of the different 
depth estimations, the analysis will be performed here only for the central 
SAI. Figure 7.8 shows the measurement of the partially specular hard disk 
and the result of the 3D reconstruction. Because hard disks in general 
have high planarity, the deviation from the ideal plane is calculated as a 
quality measure. 

The left side of the figure shows the reconstruction error for which the 
respective regularization result from figure 7.5 was used. The confidence 
of the depth is used to mask invalid pixels. The pure light field depth 
estimate of the diffuse surface is therefore only sparsely available in 
areas of high roughness or near the color markers. All other regions are 
evaluated as invalid by the depth estimation, which is accounted for by 
w; = 0in the fusion. The reconstruction error of the pure regularization is 
relatively high with an RMSE of 14.80 mm. The indirect depth estimation 
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RMSE = 4.24mm. RMSE = 24.06 pm. 


Figure 7.8 Reconstruction of a partially specular disk. The plots show the distance between 
the disk and an ideal plane. A logarithmic colormap is used for better visualization. 
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and the multi-view regularization are much denser and less noisy. The 
RMSE values of the reconstruction are smaller at 3.76 mm and 4.02 mm, 
respectively. In areas of weak reflection, pixels are marked as invalid in 
the indirect depth estimation because the confidence is very low and the 
depth estimation yields significantly erroneous values. The multi-view 
estimation also has small confidence values in the same areas, but can 
still provide reasonably correct values. 

If the regularization is used to provide support points for the normal 
integration from Sec. 7.2.1, then the reconstruction error can be signifi- 
cantly reduced. Because the depth estimation is only available sparsely 
in some places, the intermediate values must be interpolated as initial- 
ization of the surface, with the help of which the surface normals can 
then be calculated. For all further steps in Alg. 4 no interpolation has to 
be done, because it is sufficient to use only the valid pixels as support 
points for the fusion. The right side of figure 7.8 shows the corresponding 
results of the fusion. While the direct depth estimation has a relatively 
high error, the multiple regularization points are sufficient to allow a 
reasonably good reconstruction. Interestingly, the reconstruction with 
multi-view regularization with RMSE = 18.19 ym is better than the one 
with the indirect depth estimation with RMSE = 29.00 ym, although the 
regularization points of the indirect depth estimation have the smallest 
error overall. This can be explained by the fact that although the disk has 
been manufactured precisely and with a high degree of flatness, it may 
have a very slight curvature due to external forces, e.g., resulting from 
adding paint markings or from the deliberate application of scratches 
and dents. Therefore, the indirect regularization yields slightly incorrect 
data, as explained before. 

Since all regularization methods use different information as a basis, 
the regularization points have different uncertainties and can thus jointly 
contribute to the improvement of the reconstruction. For this purpose, 
firstly, a weighted average of the individual regularization is calculated. 
Further, in the depth and normal fusion (7.19) all depth estimates are 
used jointly, weighted by the respective confidences. Figure 7.8(d) and (h) 
show the respective reconstruction errors. Although it often helps to 
merge different sources of information, in this example the result both 
times is worse than when using only the multi-view regularization. The 
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cause of this may be that the used confidence measures do not necessarily 
represent the uncertainty of the regularization and therefore may not 
be used as equivalent weights. On the other hand, the regularization 
methods may show systematic errors that cannot be assessed using a 
confidence estimation. 


7.3.3 Multi-View Reconstruction 


The light field depth estimation algorithms proposed in the literature 
generally provide only the depth of the central SAI, because the high 
redundancy of the light field is usually not needed after the estimation. 
In principle, the algorithms could be adapted to compute the depths in 
other SAls, but the depth estimation-based regularization approaches 
had other drawbacks, as noted in the last sections, so they will not be 
considered any further here. The advantage of multi-view regularization 
is that multiple views can be used to increase the lateral resolution of the 
reconstruction. For this purpose, all depth estimates from all SAIs are 
transformed into a uniform grid. 


7.3.3.1 Evaluation of the Reconstruction Accuracy 


Since the hard disk from the previous sections can only be regarded as 
approximately planar, different reference mirrors with known shapes 
are used to quantify the accuracy of the reconstruction in the following. 

For the first experiment, a precision surface mirror with \/20 flatness is 
used as the surface under test. With the reference wavelength of 632.8 nm, 
the mirror has a maximum peak-to-valley deviation from the perfect 
plane of 31.64nm. Thus, compared to the achievable accuracy of the mea- 
surement system in this work, it can be considered absolutely flat. There- 
fore, as a quality measure, a perfect plane is fitted into the reconstructed 
point cloud, and for each point, the distance to this plane is evaluated as 
a quality measure. Figure 7.9 shows the results of the surface reconstruc- 
tion. The point cloud, which can be obtained with the help of multi-view 
regularization, already provides a reasonably good reconstruction. Over- 
all, however, the reconstructed surface still appears slightly noisy. The 
corresponding error map indicates that the surface is not yet smooth. Af- 
ter optimization by fusion with the estimated surface normals, the result 
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Figure 7.9 Reconstruction of a planar mirror: (a) & (c) The accuracy of the regularization 
is quantified with RMSE = 89.71 m, PV = 478.24 pm. (b) & (d) The accuracy of the 
reconstruction is quantified with RMSE = 0.99 um, PV = 7.94 pm. 


is better. The RMSE decreases to 0.99 um and the PV metric yields 7.94 um. 
Thus, the reconstruction result shows comparable accuracy to other de- 
flectometric measurement systems from the literature [103, 154, 230]. 
In a second experiment, a convex surface is to be reconstructed. The 
reference mirror has a radius of curvature of R = 1/« = 800 mm and pla- 
narity of A/2, which can still be considered a nearly perfect reference for 
the measurement accuracy of the deflectometry system used in this work. 
Since the shape of the mirror is known, the distance to the ideal surface 
is again used as a quality measure. Figure 7.10 shows the results of the 
surface reconstruction. The point cloud of the regularization appears 
very noisy and there are strong errors at the edge of the surface. This can 
also be seen in the corresponding error map. The overall error is quite 
high with RMSE = 333.85 pm and PV = 1.50 mm. Looking at the surface 
in detail, a systematic wave-like structure can be seen on the surface. An 
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Figure 7.10 Reconstruction of a convex mirror: (a) & (c) The regularization results in 
RMSE = 333.85 pm, PV = 1.50 mm. (b) & (d) The reconstruction results in RMSE = 
12.02 pm, PV = 41.03 pm. 


explanation for this effect could be vibrations during the measurement 
or a slightly faulty calibration. However, an exact cause is not known. 
Still, the reconstruction of the surface using the depth and normal fusion 
shows reasonable good results and the ripples in the surface disappear 
as well. The shape of the surface is clearly recognizable and the accuracy 
increases strongly to RMSE = 12.02 ym and PV = 41.03 ym. However, 
the reconstruction accuracy is not as good as for the planar surface, which 
is probably due to the inferior result of the regularization. 

As a last experiment, a concave surface is reconstructed. The reference 
mirror has a radius of curvature of R = 406mm and planarity of \/4. 
Figure 7.11 shows the results of the surface reconstruction. The surface 
can already be recognized in the point cloud of the regularization. As 
before, a wave-like structure appears on the surface. The error of the reg- 
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Figure 7.11 Reconstruction of a concave mirror: (a) & (c) The regularization results in 
RMSE = 1.34mm, PV = 4.00 mm. (b) & (d) The reconstruction results in RMSE = 
54.75 um, PV = 210.50 pm. 


ularization is relatively high with RMSE = 1.34mm and PV = 4.00mm, 
which can also be attributed to the peculiarities of concave surfaces. As 
explained in Sec. 7.3.1.2, the disparity minimization of concave surfaces 
is more susceptible to noise. Still, the final result of the depth and normal 
fusion shows a strongly improved result. 


7.3.3.2 Lateral Resolution 


An advantage of the multi-view regularization is that the lateral resolu- 
tion of the surface reconstruction is not limited by the spatial resolution 
of the central SAI. The resolution can be specified by the user. The light 
field used here has the dimension (N,,, N,, N.,N,) = (13,13, 434, 625). 
Assuming that each SAI increases the resolution, the maximum possible 
resolution of an orthographic grid is therefore approximately 13 times the 
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resolution of a single SAI. To show the advantage of the higher resolution, 
the partially specular disk from the previous sections will be examined 
in the following. The reconstruction was performed with the resolutions 
400 x 400, 1500 x 1500, and 4000 x 4000, where the grid is defined to 
enclose the valid surface points as accurately as possible. The smallest 
resolution corresponds approximately to the resolution that would be ob- 
tained if only the central SAI would be considered in the reconstruction. 
Figure 7.12 shows the results of the reconstruction of the disk, as 
well as close-up views that have been reconstructed with the different 
resolutions. The disk shows local defects in the form of scratches and 
dents. With the low resolution, the defects in the disk can hardly be 
identified. This shows that it is not sufficient to use only the central SAI 
for the reconstruction. With a resolution of 1500 x 1500, the defects in 
the disk can be recognized very well. If the resolution is increased even 
further, there is hardly any noticeable improvement. This is probably 
related to the fact that the surface normal n for the reconstruction is 
calculated as the weighted average of the normal estimates n,,, from all 
SAls. A more selective choice of the best normal estimated from all SAIs 
or a more sophisticated weighting might therefore improve the results. 


7.3.4 Influence of the Calibration 


A substantial part of this thesis was dedicated to the accurate calibration 
of the deflectometric measuring system. That the effort was worthwhile 
will be shown in the following. 

For the evaluation, the surface reconstruction was carried out based 
on four different configurations of the system calibration. The light field 
camera was calibrated using the procedure of Bok et al. [20] and using the 
generic light field reconstruction procedure presented in Ch. 6. In addi- 
tion, for each camera calibration, the influence of the monitor model from 
Sec. 5.3 was analyzed. To assess the reconstruction quality, the planar ref- 
erence mirror was again used and the deviation from the ideal plane was 
evaluated. Figure 7.13 shows the results of the respective reconstructions. 

The results impressively demonstrate that the camera model has a 
significant impact on the reconstruction accuracy. With the calibration 
method by Bok et al. [20] the surface can still be reconstructed with high 
accuracy, but if the proposed generic LF-reconstruction is used, the re- 
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Figure 7.12 Reconstruction of the partially specular disk with different resolution of the 
grid parameters. Local defects can be identified by increasing the lateral resolution. The 
top row shows the content of the red rectangle with an area of approximately 5mmx4mm. 
The bottom row shows the content of the blue rectangle with an area of approximately 
3 mmx2.5 mm. 
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Figure 7.13 Influence of the calibration on the reconstruction accuracy. The plots show 
the distance to an ideal plane. Note the difference in scale. (a) RMSE = 2.30 um, 
PV = 8.71pm. (b) RMSE = 50.19pm, PV = 219.19 um. (c) RMSE = 0.99 um, 
PV = 7.94 pm. (d) RMSE = 49.77 um, PV = 216.92 pm. 


sults are significantly better. For the method of Bok et al., it seems that the 
surface shows a slight curvature. This is most likely caused by the com- 
paratively inferior geometric calibration. As was pointed out in Sec. 6.3, 
the quality of the SAIs at the edge of the angular plane is worse than in 
the center. Accordingly, the calibration error is higher, which corrupts the 
deflectometric triangulation of the normal field and, in addition, leads to 
an erroneous normal measurement. The precise calibration of the light 
field camera presented in this work is therefore indispensable for the 
deflectometric reconstruction of specular surfaces. 

The monitor model also affects the result, although not as much as the 
camera calibration. For both camera calibrations, the accuracy is slightly 
better with the monitor model. The RMSE is minimally better and the PV 
decreases a few micrometers in both cases. Comparing the results of the 
generic LF-reconstruction, it is noticeable that without using the monitor 
model, the surface shows a slight curvature. The falsely assumed planar 
monitor display is transferred into a falsified surface reconstruction. By 
using a monitor model, this systematic error can be corrected. 
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7 Specular Surface Reconstruction 


7.4 Summary 


This chapter described how the special optical properties of a light field 
camera can be used for the deflectometric reconstruction of specular sur- 
faces. The information contained in the light field opens up the possibility 
of regularizing the ambiguity of the deflectometric normal estimation. 
It was explained how classical light field depth estimation algorithms 
can be used to extract regularizing information and how the light field 
camera can be interpreted as a highly multi-view camera array to en- 
able a multi-view regularization by triangulating the normal field. To 
further increase the reconstruction accuracy, the normal measurements 
were fused with the surface obtained from the regularization using a 
variational optimization approach, which was solved with a primal-dual 
optimization algorithm. 

Experiments showed that, despite regularization techniques of dif- 
ferent quality, good results can be achieved.The depth estimate-based 
regularization methods are only applicable for a very special group of 
surfaces, whereas the multi-view regularization can be used for arbitrary 
specular free-form surfaces and provides better results. Further inves- 
tigations showed that by fusion of the depth and normal estimates the 
reconstruction accuracy can be drastically improved so that accuracies 
in the lower micrometer range become possible, which is comparable 
to other deflectometry systems from the literature. Moreover, the cali- 
bration of the measurement system had a significant influence on the 
accuracy of the reconstruction. Hence, the calibration methods that were 
presented in this thesis are very well suited for deflectometry. 

In summary, light field-based deflectometry can be realized efficiently 
and in a compact design. Despite the very small stereo baseline between 
the SAIs, but due to their immense number, high accuracy of the mea- 
surement can be achieved. This enables the reconstruction of the global 
surface form as well as local defects. 
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8 Conclusion 


This thesis investigated how light field imaging can be efficiently utilized 
for deflectometry. While the key statements of the individual research 
topics have already been summarized at the end of the respective chap- 
ters, the results achieved with a view on the context of the entire thesis 
are summarized here. 


8.1 Summary 


Deflectometry requires structured illumination, where the encoding of 
the monitor pixel intensities enables the registration of camera pixels to 
monitor pixels. In this thesis, multi-frequency phase-shifting techniques 
were used as they provide high measurement accuracies and allow sub- 
pixel accurate registration. At the same time, however, they introduce 
ambiguities that can only be resolved using phase unwrapping methods. 
Furthermore, several classical methods for phase unwrapping have been 
studied. As a major contribution, a new probabilistic approach for phase 
unwrapping was proposed. Using circular statistics, both the periodic- 
ity of the phase is taken into account and the estimation of the phase 
uncertainty can be included in the unwrapping process, thus automati- 
cally compensating for individual erroneous phase measurements. By 
performing a maximum-likelihood optimization on the probability dis- 
tribution of the phase measurement, the optimal monitor coordinate can 
be decoded for each camera pixel. Moreover, it was shown that by mod- 
eling the local pixel neighborhood, the robustness of the method can be 
improved further, leading to a probabilistic approach for spatio-temporal 
phase unwrapping. Overall, the results showed that the proposed meth- 
ods are significantly more robust to noise influences than state-of-the-art 
methods, resulting in ideal starting conditions for use in deflectometry. 
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8 Conclusion 


Highly accurate calibration is an important prerequisite for precise 
deflectometric measurements. In this thesis, a generic camera model was 
used to calibrate the light field camera, in which the view rays associ- 
ated with each pixel are estimated individually, resulting in a highly 
accurate calibration. To estimate the camera parameters, it was proposed 
to split the calibration into two subproblems, a ray calibration and a 
pose estimation, and it was shown how an alternating minimization 
approach can be used to deal with the tremendous number of param- 
eters. Calibration features were obtained using phase-shift coding and 
the estimated coordinate uncertainty was used as weighting in the opti- 
mization. An analytical solution was given for the ray calibration, and 
the pose was optimized using a gradient descent-based method on the 
rotation manifold. Since the reference monitor used for calibration is not 
ideal, the shape and the refraction at the cover glass were modeled, and 
it was shown how the estimation of the respective parameters could be 
efficiently integrated into the generic calibration framework. Finally, ex- 
periments demonstrated the superiority of the presented generic method 
over classical calibrations and other generic approaches. 

While the generic calibration is very precise, it provides an uncon- 
strained bundle of camera rays. The relationships among these rays are 
lost. Thus, with the generic camera model, it is extremely difficult to 
identify to which pixel a 3D point is projected or which ray is closest to 
that point. For deflectometry, this forward and backward projection is 
a necessity for a correct surface triangulation. In the case of the generi- 
cally calibrated light field camera, this means that the 4D information 
contained in the light field and, in particular, the relations between the 
individual camera rays must be recovered. To achieve this, this thesis 
proposed to use the generic camera calibration as a basis to perform a 
generic light field reconstruction. The approach reconstructs the light 
field from the camera raw data by only considering the geometry of 
the camera rays and by resampling the corresponding intensity values. 
Experiments validated the approach by reconstructing light fields from 
different light field cameras. A comparison with state-of-the-art light 
field reconstruction methods showed that the presented method is better 
able to compensate for lens aberrations since these are already optimally 
contained in the generic bundle of rays. The method was therefore able 
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8.1 Summary 


to reconstruct the information of the observed scene as well as to return 
the geometric structure of the light field with the help of an adequate 
rectification and calibration. This can be done regardless of whether the 
light field camera is based on microlenses, mirrors, or coded apertures, 
or whether it is implemented by using a camera array. 

With the help of the registration and calibration, a deflectometric mea- 
surement could finally be carried out. Since the deflectometric normal 
measurement is inherently ambiguous, different regularization methods 
were proposed, which take advantage of the special properties of the 
light field camera. As the most important aspect, a multi-view approach 
was adapted which interprets the light field camera as a highly multi- 
plexed camera array, where possible surface normals can be calculated in 
the field of view of each of these cameras. The normals differ in general, 
yet must coincide on the true surface. By comparing the normal fields, an 
initial estimate of the surface can be found. Moreover, an approach was 
presented to fuse the regularization points with the deflectometrically 
measured surface normals to further increase the accuracy of the surface 
reconstruction. The fusion was formulated as a variational optimization 
problem and a solution was found using a primal-dual algorithm. Exper- 
iments showed that with regularization alone, the mirror surfaces can be 
reconstructed with accuracies in the upper micrometer range. By fusion of 
the depth and normal estimates, the result could be drastically improved 
again, where the reconstructed surface shape deviated from the reference 
shape with RMSE values around 11m and peak-to-valley ratios of less 
than 10 um. The investigated light field-based deflectometry approach 
thus comes within similar orders of magnitude as comparable methods 
from the literature. Furthermore, by evaluating the influence of the sys- 
tem calibration, it became clear that the proposed generic light field recon- 
struction provides significantly higher surface reconstruction accuracy as 
compared to when using state-of-the-art light field calibration methods. 
This showed that the precise calibration presented in this thesis is an im- 
perative necessity for deflectometric reconstruction of specular surfaces. 

In conclusion, light field-based deflectometry can be efficiently imple- 
mented and enables high-precision reconstruction of specular surfaces. 
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8 Conclusion 


8.2 Outlook 


The following is a presentation of ideas and concepts that have emerged 
in the context of this thesis and that present future research opportunities. 
The generic camera model presented in this thesis represents only the 
simple geometric properties of the camera, moreover, it assumes that 
each pixel can be perfectly described by a single ray. In reality, however, 
the widening of the light ray induced by the camera optics results in 
not every distance being in focus. Comparatively, a cone would be a 
more accurate description, where its expansion and shape change as a 
function of distance. An estimation of the cone parameters would make 
the generic camera model more complete. When a ray hits the monitor, 
an intersection plane is created between the corresponding cone and the 
monitor plane. The area observed by the corresponding pixel is elliptically 
distorted to different degrees depending on the tilt of the monitor. The 
uncertainty of the horizontal and vertical monitor coordinates, which 
can be estimated by phase-shift coding, corresponds to the axes of this 
ellipse. The observation of different intersection planes could open up 
the possibility of determining the distance-dependent focus parameters 
for each ray in addition to the geometric parameters. With multi-focus 
light field cameras, the proposed generic light field reconstruction is still 
subject to limitations, since here sharp rays and blurred rays are processed 
together. The extension of the generic camera model by focus parameters 
should be helpful for the generic light field reconstruction as well. 
While the generic light field reconstruction yields good results for 
classical light field cameras, it would be interesting to apply it in the 
context of spectrally coded light field cameras as proposed by Schambach 
[178]. These cameras encode the spatial dimension of the light field using 
a spectral mask such that there is only a single spectral channel for each 
pixel. The geometric calibration of these cameras is difficult since adjacent 
pixels contain very different information. The generic approach could 
circumvent these difficulties by reconstructing an individual light field for 
each spectral channel, which could then be merged into a single calibrated 
and rectified light field. Whether the SAIs remain spectrally encoded or 
whether one directly reconstructs the complete light field for each spectral 
channel, e.g., by using the generic superresolution approach described in 
this thesis, depends on the requirements of the following applications. 
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8.2 Outlook 


The reconstruction of specular surfaces still has potential for improve- 
ment. The lateral resolution of the measurement is not limited by the 
spatial resolution of the light field, as has been shown, but can be in- 
creased by considering the angular dimension. However, this could only 
increase the lateral resolution to a certain extent, since the surface nor- 
mals were always calculated as the average of the estimates from all 
SAls. Thus, a more sophisticated calculation of the normals could im- 
prove the resolution. Alternatively, the depth and normal fusion could 
be combined with variational superresolution approaches [206]. 

The light field camera is equivalent to a multiple camera array where 
the baseline between the cameras is in general not much larger than 
1mm depending on the model. A perspective change in the light field 
can therefore be considered as a quasi-continuous movement of a single 
camera. This allows estimating specular flow which occurs when the 
reflection of a structured environment is observed on a specular surface. 
While this thesis focused on a high accuracy reconstruction of specular 
surfaces, specular flow can also be used for defect detection tasks or 3D 
measurements with lower accuracy requirements [2, 144]. In the context 
of this thesis, research was conducted on using a light field camera to in- 
duce specular flow and use CNNs to reconstruct the surface, where only 
a structured but unknown environment was required. This thesis does 
not cover this approach, since it only works with synthetic data to some 
extent, but cannot be used for real cameras without further modifications. 
Unlike CNN-based disparity estimation commonly used in the literature, 
surface reconstruction and depth estimation are strongly coupled to the 
respective camera parameters. Training the CNN on synthetic data and 
applying it to real data is therefore not possible for the time being. Other 
applications have already shown that it is possible to consider the camera 
parameters during the design of the CNN’s architecture [52]. Adopting 
this approach for the reconstruction of specular surfaces could lead to 
interesting results. The advantage of specular flow is that it does not re- 
quire a temporal encoding of the illumination but only needs a structured 
reference scene. With the light field camera, a single exposure already 
contains all the required information for specular flow calculation. This 
would open up the possibility for deflectometry in motion. 
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9 Appendix 


9.1 Calibration 


9.1.1 Variables 
Matrices of pose subproblem 


In Sec. 5.2.5, for every single pose with index k, an optimization problem 
with objective function 


f(Ry, tp) = > Wip (RX, + tx) x d; — mj)? (9.1) 


is obtained. This can be written in a more compact form by using the 
Kronecker identity (2.3), the cross product operator (2.4), the vec-operator 
with r, = vec (R,,), and the introduction of some new variables: 


Ann = I Win (Kar) ® (di), [IR , (9.2) 
Attn = D wir [di] [dle , (9.3) 
Au = > 2wir (di) X @ (dil) » (9.4) 
bik = > 2, (X, ® [d;]%)" m,, (9.5) 
bee = > 2w;, [dj] m;, (9.6) 
hy = Yo Im; 1? , (9.7) 


which results in the more compact form: 


FE te) = Pp An kretti Ap tetty Atr krk bi tk bity thy,» (9.8) 
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9 Appendix 


Matrices of rotation subproblem 


In Sec. 5.2.5, an optimization problem for the rotation estimation with 
objective function 
fR)=r’Ar+b’r+c (9.9) 


is obtained. The corresponding parameters can be easily derived by 
inserting (5.24) in (9.8): 


Nahe TATAG A, (9.10) 
babe SATAUb,, (9.11) 
en Tb? Azb,. (9.12) 


9.1.2 Riemannian Gradient and Hessian on SO(3) 


In Sec. 5.2 to minimize the rotation subproblem (5.25), the Riemannian 
Gradient and Hessian were needed. In the following, it is demonstrated 
how the gradient and Hessian for the rotation subproblem (5.25) from 
Sec. 5.2 can be calculated. 


Gradient 


In the so(3)-tangent space, the derivative in direction £ is calculated. With 
R(€) = e«R, r = vec(R) and Z as defined in (2.7), it follows: 


De fR) = Oe fee (R)|._, 7 (9.13) 
ð- fee R) = d.r(&e)" Of (R)|pr(ge) = der(&e)"2(Ar(&e)+b) (9.14) 
= Zvec(|E], ER)" (Ar(€e) + b) . (9.15) 


With £ — 0 it follows: 


De f (R) = 3; fee (R)|__, = 2vec((é],, R)" (Ar + b) (9.16) 
= ((RT ST) vec([&],))" (Ar +b) (9.17) 
29) o¢TZT (R @I) (Ar +b) = ¿Tgrad( f) . (9.18) 
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9.1 Calibration 


Finally, the gradient of the locally parameterized objective function can 
be obtained: 


grad(f) = 2Z" (R 8 I) (Ar +b) . (9.19) 


Hessian 


The second order derivative is calculated similarly to the previous calcu- 
lations: 


De grad( f) = lim 32, f.g (R) , (9.20) 
¿Tð grad( f) = €"9.22" (R(€e) OT) (Ar(Ee) + b) (9.21) 
= ð, (2vec(lel, eR)" (Avec(e*&)R) +b)) . (9.22) 


With e — 0 it follows: 


Dg grad(f) = 2vec([E]? R)" (Ar + b) + 2vec([¢], R)" Avec((é],, R) . 
(9.23) 
With the reshape operator mat(vec(A)) = A, it follows: 


2vec( |]? R)” (Ar + b) = 2€72" ((¢], R 81) (Ar +b) 
© 26T ZT vec(mat(Ar +b) RT [E]?) 
= 2¢7Z? (I @ mat(Ar + b) R) vee([£]) 


x 


C9 _9gTZT (18 mat(Ar +b) RT) Ze 

= T Hess, (f), (9.24) 
avec({é], R)T Avec([E], R) = 2ETZT (R 81) A (R 81)T Z£ 

= ¿T Hess, ( f)E. (9.25) 


Finally, the Hessian of the locally parameterized objective function can 
be obtained: 


Hess( f) = Hess, (f) + Hess, (f) (9.26) 
=2Z" ((R@I)A(R@I)' —1@mat(Ar+b)R")Z. (9.27) 


225 


9 Appendix 


9.1.3 Proofs 
Invertibility of A, 


Calculating the translation vector from the rotation in Sec. 5.2.5 requires 
the matrix A,, to be invertible. Here, it is shown that A,, is positive 
definite in most cases and thus invertible. It needs to be shown: 


x’A,x>0 = A,, is invertible. (9.28) 
With ||d;|| = 1, w;, > 0 and Yx € R? with |x|| > 0 it follows: 
xTALX =x" 5 Wig [di], [d,]* 2 ze > Wix" [dj]. [d,]* x 
T 2 

= = Wik ([d,]T x) CAM x= > Wik laJ? x|| 

=) wy [x x dil? > 0. 
This is always true, except for the degenerate case of parallel rays, e.g., 
orthographic projection, telecentric optics. Then x = sd, , Vi with some 
arbitrary scalar s, results in x’ A,,x = 0. In this case, there is an ambi- 
guity in the translation term, because it is not possible to estimate the 


distance between the calibration pattern and a camera with orthographic 
projection: 


t = to + sdo. (9.29) 


Convergence of AM-Calibration 


Following the research in the field of alternating minimization [65, 139], 
the following proofs that the proposed alternating minimization tech- 
nique for camera calibration is convergent. Thus 


f (Pern, ge) < f(P™, £m) (9.30) 
needs to be shown, where £L” is the set of ray parameters and P™ = 


[R™, T™] the set of pose parameters, consisting of rotations and trans- 
lations. 
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9.1 Calibration 


Define the operators S;, and Sp, as solution to the ray subproblem of 
Sec. 5.2.4 and as solution to the pose subproblem of Sec. 5.2.5, respectively: 


SLLF(PML™)} = F(PM. LO) , (9.31) 
Se {I (P™®,L£ Lor) y= f (P p+) en) f (9.32) 


Because the optimization of camera rays delivers an optimal solution to 
its subproblem, we cannot get an increase in the objective function: 


S T(P Ly as (PALA) (9.33) 


Furthermore, if the Newton descend algorithm for pose estimation is ini- 
tialized with the previous pose, we always get a descend in the objective 
function value: 


Spf (P LO] SEEN); (9.34) 


In conclusion, it follows: 


herr el (n+1) = Sp {f(P” Lorn) 


< (pP nD £ 

= 6 TAPA £m)\ 

< f(P™®, £™®), (9.35) 
= f (P, L0) < f(P™®, L) qed. (9.36) 
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